New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: GLS fix for very large models #1176
Conversation
GLS can now handle very large models with diagonal sigma matrices. This was done by removing the need for a full nobs by nobs covariance when sigma is input as a scalar or vector. Test coverage for large model with 100,000 observations. This would require > 64GB of main memory in previous GLS code.
return np.dot(self.cholsigmainv, X) | ||
else: | ||
if X.ndim == 2: | ||
return X * self.cholsigmainv[:, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should benp.dot(self.cholsigmainv, X)
as before ?
(we are not using np.matrices, and * is always elementwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cholsigmainv
is not a nobs
by nobs
array, but is now a nobs
array, and so element-by-element with broadcasting is needed.
When nobs
is very large and sigma
is a vector or scalar, creating a nobs
by nobs
array is problematic. For example, when nobs=100000
then this array requires about 64GB of memory.
This tries to avoid this penalty. It is also algorithmically superior since multiplying a diagonal nobs
array requires O(nobs**2
) operations while this is O(nobs
)
Or maybe I am misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, I didn't read the first if
elif self.sigma.ndim == 2:
on the higher if should be clearer to understand. ?
We don't have unit tests for this, since GLS subclasses implement their own more efficient (I don't remember whether we decided at some point to ignore GLS when it is equivalent to WLS ) Related: |
I also refactored Structured inverses deserve their own treatment. For example, inverting an toeplitz matrix can be done in O( |
Removed one level of an if statement in GLS.whiten and reordered from lowest to highest for the dimension
If |
I'm being a bit loose because I was thinking of a structured Toeplitz matrix which comes from a finite order AR model. This inversion is O( argriffing notifications@github.com wrote: For example, inverting an toeplitz matrix can be done in O(nobs) operations If nobs is the order of the matrix, then I think this should be O(nobs**2)http://en.wikipedia.org/wiki/Levinson_recursion rather than O(nobs). On the other hand, 'toeplitz' might mean something different to econometricsists than it does to everyone else, because my scipy toeplitz inversion PR scipy/scipy#2754scipy/scipy#2754 did not go over particularly well. For example, 'toeplitz' might always mean narrow-banded toeplitz in econometrics. — |
AR(p) is a good example where memory is more important, the covariance matrix, sigma, is dense even with a finite lagorder, but the inverse sigma (precision matrix) and cholsigmainv are banded and sparse. @argriffing The PR for levinson-durbin in scipy is fine IMO, it's just missing some returns to be useful to statsmodels. |
I suspect those required returns would be related to a finite order autoregressive model. In other words, I think it is not just a matter of returning a few more things that the algorithm in the PR is computing internally anyway, but rather that the extra returns you need for statsmodels are related to the particular structure of your matrices which is not assumed by the toeplitz solver. |
class TestGLS_large_data(TestDataDimensions): | ||
@classmethod | ||
def setupClass(cls): | ||
nobs = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything in the test that really uses this large nobs? We don't have a check for timing or memory consumption.
To me it looks like the test is unchanged if we just use nobs=1000, which would be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 100k since this would break any reasonable computer, although it is slow. Aside from ensuring that large GLS causes a memory error in the old version code, 1000 is fine for checking the result..
looks also good, no problem to merge a little bit behind master but straight forward branching, in the network |
merged after rebasing and fixing some merge conflicts, mainly in slogdet of loglike I reduced the sample size to 1000 in the tests. Breaking a computer as a test sound too "impolite" to me. Thanks Kevin |
GLS can now handle very large models with diagonal sigma matrices. This was done by removing the need for a full nobs by nobs covariance when sigma is input as a scalar or vector.
Test coverage for large model with 100,000 observations. This would require > 64GB of main memory in previous GLS code.