ENH: GLS fix for very large models #1176

bashtage · 2013-11-04T11:47:46Z

GLS can now handle very large models with diagonal sigma matrices. This was done by removing the need for a full nobs by nobs covariance when sigma is input as a scalar or vector.

Test coverage for large model with 100,000 observations. This would require > 64GB of main memory in previous GLS code.

GLS can now handle very large models with diagonal sigma matrices. This was done by removing the need for a full nobs by nobs covariance when sigma is input as a scalar or vector. Test coverage for large model with 100,000 observations. This would require > 64GB of main memory in previous GLS code.

coveralls · 2013-11-04T12:00:51Z

Coverage remained the same when pulling 8f06db1 on bashtage:gls-large-data-size into 831e2ad on statsmodels:master.

josef-pkt · 2013-11-04T12:13:18Z

statsmodels/regression/linear_model.py

+                return np.dot(self.cholsigmainv, X)
+            else:
+                if X.ndim == 2:
+                    return X * self.cholsigmainv[:, None]


This should benp.dot(self.cholsigmainv, X) as before ?
(we are not using np.matrices, and * is always elementwise.

cholsigmainv is not a nobs by nobs array, but is now a nobs array, and so element-by-element with broadcasting is needed.

When nobs is very large and sigma is a vector or scalar, creating a nobs by nobs array is problematic. For example, when nobs=100000 then this array requires about 64GB of memory.

This tries to avoid this penalty. It is also algorithmically superior since multiplying a diagonal nobs array requires O(nobs**2) operations while this is O(nobs)

Or maybe I am misunderstanding?

My mistake, I didn't read the first if

elif self.sigma.ndim == 2: on the higher if should be clearer to understand. ?

josef-pkt · 2013-11-04T12:37:06Z

We don't have unit tests for this, since GLS subclasses implement their own more efficient whiten.
But the change looks good, and if something else shows up we can fix it there also.

(I don't remember whether we decided at some point to ignore GLS when it is equivalent to WLS )

Related:
The more general solution that I thought about but only tried out in a few experiments:
cholsigmainv could be just a linear operator that knows how to calculate a dot product. This would be useful when we can directly specify the cholsigmainv transformation as a sparse matrix, e.g. for AR(p) or cluster/panel correlation.
That's similar to how whiten is currently used, but we could push this down one level, where we then build up a sparse cholsigmainv directly instead of going through defining a possibly dense sigma.
(We don't have a collection of cholsigmainv for different covariance patterns yet.)

bashtage · 2013-11-04T12:42:41Z

I also refactored GLS.whiten to handle this case.

Structured inverses deserve their own treatment. For example, inverting an toeplitz matrix can be done in O(nobs) operations rather than O(nobs**3) for a dense matrix. Similar tricks are possible for other banded or blocked matrices.

Removed one level of an if statement in GLS.whiten and reordered from lowest to highest for the dimension

coveralls · 2013-11-04T13:00:13Z

Coverage remained the same when pulling 188956b on bashtage:gls-large-data-size into 831e2ad on statsmodels:master.

argriffing · 2013-11-04T13:25:10Z

For example, inverting an toeplitz matrix can be done in O(nobs) operations
rather than O(nobs**3) for a dense matrix.

If nobs is the order of the matrix, then I think this should be O(nobs**2) rather than O(nobs). On the other hand, 'toeplitz' might mean something different to econometricsists than it does to everyone else, because my scipy toeplitz inversion PR scipy/scipy#2754 did not go over particularly well. For example, 'toeplitz' might always mean narrow-banded toeplitz in econometrics.

bashtage · 2013-11-04T13:36:29Z

I'm being a bit loose because I was thinking of a structured Toeplitz matrix which comes from a finite order AR model. This inversion is O(nobs) while I'm sure a general Toeplitz without further structure would be O(nobs**2)

argriffing notifications@github.com wrote:

For example, inverting an toeplitz matrix can be done in O(nobs) operations
rather than O(nobs**3) for a dense matrix.

If nobs is the order of the matrix, then I think this should be O(nobs**2)http://en.wikipedia.org/wiki/Levinson_recursion rather than O(nobs). On the other hand, 'toeplitz' might mean something different to econometricsists than it does to everyone else, because my scipy toeplitz inversion PR scipy/scipy#2754 scipy/scipy#2754 did not go over particularly well. For example, 'toeplitz' might always mean narrow-banded toeplitz in econometrics.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1176#issuecomment-27684263.

josef-pkt · 2013-11-04T13:58:14Z

AR(p) is a good example where memory is more important, the covariance matrix, sigma, is dense even with a finite lagorder, but the inverse sigma (precision matrix) and cholsigmainv are banded and sparse.

@argriffing The PR for levinson-durbin in scipy is fine IMO, it's just missing some returns to be useful to statsmodels.

argriffing · 2013-11-04T14:06:34Z

The PR for levinson-durbin in scipy is fine IMO,
it's just missing some returns to be useful to statsmodels.

I suspect those required returns would be related to a finite order autoregressive model. In other words, I think it is not just a matter of returning a few more things that the algorithm in the PR is computing internally anyway, but rather that the extra returns you need for statsmodels are related to the particular structure of your matrices which is not assumed by the toeplitz solver.

josef-pkt · 2013-11-29T20:02:11Z

statsmodels/regression/tests/test_regression.py

+class TestGLS_large_data(TestDataDimensions):
+    @classmethod
+    def setupClass(cls):
+        nobs = 100000


Is there anything in the test that really uses this large nobs? We don't have a check for timing or memory consumption.
To me it looks like the test is unchanged if we just use nobs=1000, which would be faster

I used 100k since this would break any reasonable computer, although it is slow. Aside from ensuring that large GLS causes a memory error in the old version code, 1000 is fine for checking the result..

josef-pkt · 2013-11-29T20:05:33Z

looks also good, no problem to merge

a little bit behind master but straight forward branching, in the network

josef-pkt · 2013-12-11T20:19:09Z

merged after rebasing and fixing some merge conflicts, mainly in slogdet of loglike

I reduced the sample size to 1000 in the tests. Breaking a computer as a test sound too "impolite" to me.
But we should look into memory profiler and see if we can reuse it for a test or for separate memory checks.

Thanks Kevin

josef-pkt reviewed Nov 4, 2013
View reviewed changes

Simplified branching in GLS.whiten

188956b

Removed one level of an if statement in GLS.whiten and reordered from lowest to highest for the dimension

josef-pkt reviewed Nov 29, 2013
View reviewed changes

This was referenced Nov 29, 2013

Comparing linear models using lagrange multiplier tests #1174

Closed

Compare lr test rebased #1214

Merged

josef-pkt closed this in 53c5fa5 Dec 11, 2013

bashtage deleted the gls-large-data-size branch March 24, 2014 23:17

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge branch 'gls-large-data-size' closes statsmodels#1176

d70da64

bashtage added the rejected Used for PRs with changes that are not acceptable label Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: GLS fix for very large models #1176

ENH: GLS fix for very large models #1176

bashtage commented Nov 4, 2013

coveralls commented Nov 4, 2013

josef-pkt Nov 4, 2013

bashtage Nov 4, 2013

josef-pkt Nov 4, 2013

josef-pkt commented Nov 4, 2013

bashtage commented Nov 4, 2013

coveralls commented Nov 4, 2013

argriffing commented Nov 4, 2013

bashtage commented Nov 4, 2013

josef-pkt commented Nov 4, 2013

argriffing commented Nov 4, 2013

josef-pkt Nov 29, 2013

bashtage Dec 1, 2013

josef-pkt commented Nov 29, 2013

josef-pkt commented Dec 11, 2013

ENH: GLS fix for very large models #1176

ENH: GLS fix for very large models #1176

Conversation

bashtage commented Nov 4, 2013

coveralls commented Nov 4, 2013

josef-pkt Nov 4, 2013

Choose a reason for hiding this comment

bashtage Nov 4, 2013

Choose a reason for hiding this comment

josef-pkt Nov 4, 2013

Choose a reason for hiding this comment

josef-pkt commented Nov 4, 2013

bashtage commented Nov 4, 2013

coveralls commented Nov 4, 2013

argriffing commented Nov 4, 2013

bashtage commented Nov 4, 2013

josef-pkt commented Nov 4, 2013

argriffing commented Nov 4, 2013

josef-pkt Nov 29, 2013

Choose a reason for hiding this comment

bashtage Dec 1, 2013

Choose a reason for hiding this comment

josef-pkt commented Nov 29, 2013

josef-pkt commented Dec 11, 2013