Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add OLSVectorized, initial version #5382

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

josef-pkt
Copy link
Member

see #2203 adding special_linear_model
see #4771 adding vectorized OLS

this subclasses OLS and RegressionResults but only a subset of results will work

The core attributes are working correctly,
Among the methods I only checked conf_int and t_test so far.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 84.237% when pulling 04aa547 on josef-pkt:vectorized_ols into 14fb572 on statsmodels:master.

r"""results class for vectorized OLS
"""

@cache_writable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why writable? This is used in very few places

return (wendog * wendog).sum(0)

def conf_int(self, alpha=.05, cols=None):
#print('using OLSVectorizedResults.conf_int')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete

Examples
--------
>>> import numpy as np
>>> import statsmodels.api as sm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no examples, remove section



class OLSVectorized(OLS):
_results_class = OLSVectorizedResults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this is better than setting it in init, but to make this work you had to put the model class below the results class, contrary to the pattern. Prettier to follow statespace/regimeswitching precedent and make this a property.

res2_list = self.res2_list


attrs = ['params', 'scale', 'bse', 'tvalues', 'pvalues', 'ssr', 'centered_tss',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parametrize

@jbrockmendel
Copy link
Contributor

Big picture, what’s the difference between VectorizedOLS and SUR?

@josef-pkt
Copy link
Member Author

Big picture, what’s the difference between VectorizedOLS and SUR?

It's a special case with limited functionality that can be vectorized.
General SUR allows for different exog per equation and requires full kronecker or block diagonal long form.
SUR with common exog as in VAR still includes cov_resid and joint inference over all parameters. This special case will be Multivariate OLS as used for MANOVA.

OLSVectorized assumes common exog, but assumes cov_resid is diagonal or irrelevant for inference and only provides inference within an equation but not for the joint params. That is, it is just a computational improvement over running many (univariate endog) OLS in a loop.

@josef-pkt
Copy link
Member Author

application residual bootstrap, exog is the same for all bootstrap samples

"The code implements the multiplier bootstrap efficiently by executing all 10,000 regressions simultaneously, exploiting the fact that the regressors are common across the bootstrap replications."

Bruce E. Hansen 2017: Regression Kink With an Unknown Threshold

bootstrap is used to get simulated p-values that are not pivotal

@josef-pkt
Copy link
Member Author

test for joint hypothesis wald_test f_test seems to be missing

@zoj613
Copy link

zoj613 commented Feb 9, 2022

I this still planned for a merge? @josef-pkt
I am faced with code like the following:

    import statsmodels.api as sm

    n_rows, n_col = some_array.shape
    X = sm.add_constant(temp)
    for j in range(n_cols):
        Y = some_array[:, j]
        model = sm.OLS(Y, X)
        results = model.fit()
       # append results to some  variable

This snippet is from a function that is called in another loop many times. It is very slow and I was wondering if there is another way I can speed this loop up?

@josef-pkt
Copy link
Member Author

josef-pkt commented Feb 9, 2022

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X y (which would require working in batches).
The main issue is what more is in your loop. What statistics do you need from your results?

@josef-pkt josef-pkt added this to the 0.14 milestone Feb 9, 2022
@zoj613
Copy link

zoj613 commented Feb 9, 2022

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X (which would require working in batches). The main issue is what more is in your loop. What statistics do you need from your results?

X is a 1d array whose size is equal to the number of rows of some_array. How could this be transformed into a single regression? The main statistics I want to extract are: params, rsquared and pvalues.

@josef-pkt
Copy link
Member Author

X is a 1d array

your don't have a constant in your regression?

the basic idea is the
params = pinv(X) @ y
computes the parameters for vectorized regression of each column of y on x.

The main problem is that the residual scale and standard errors are different for each y, which needs some workarounds to get, for example, pvalues.

@josef-pkt
Copy link
Member Author

If you have only one explanatory variable, then explicit formulas are likely faster. I only looked at multiple regression here.

@josef-pkt josef-pkt mentioned this pull request Feb 13, 2022
20 tasks
@pep8speaks
Copy link

Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 14:1: F401 'numpy.testing.assert_allclose' imported but unused

@zoj613
Copy link

zoj613 commented Feb 14, 2022

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X y (which would require working in batches). The main issue is what more is in your loop. What statistics do you need from your results?

Thanks for the suggestion. I was able to get a fast solution by just using the matrix OLS formulation and take advantage of numpy's broadcasting rules. The rsquared and pvalues were fairly easy to compute by hand using scipy's t distribution and formulae for the various sums of squares.

@josef-pkt
Copy link
Member Author

rebased
I found my original script when I wrote this.
The example runs, but I'm not sure about the t_test yet.

all main attributes correspond to the loop at rtol 1e-13, I had around with a failure when agreement was around 5e-13.

attrs = ['params', 'scale', 'bse', 'tvalues', 'pvalues', 'ssr', 'centered_tss',
         'rsquared', 'llf', 'aic', 'bic', 'fvalue', 'f_pvalue']

It should be easy to merge, in case we raise notimplementederror with wald_test and similar.

other methods, e.g. get_prediction ?
predict should work, maybe not get_prediction
we raise in get_influence

@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 1 alert when merging 5f5f3b3 into 152e27d - view on LGTM.com

new alerts:

  • 1 for Unused import

@josef-pkt
Copy link
Member Author

josef-pkt commented Feb 14, 2022

t_test has unit tests, so only joint hypothesis are not supported
t_test is largely copied code and adjusted for this case

no checks for predict yet

no pandas results wrapper. I might skip that here.

summary work but only has the params table.
diaplaying vectorized top table doesn't make sence
We could add a datframe with main results in rows, e.g. rsquared, llf

update

in my example:
res.predict works
res.get_prediction raises error on broadcasting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants