ENH: add OLSVectorized, initial version #5382

josef-pkt · 2018-11-14T03:02:14Z

see #2203 adding special_linear_model
see #4771 adding vectorized OLS

this subclasses OLS and RegressionResults but only a subset of results will work

The core attributes are working correctly,
Among the methods I only checked conf_int and t_test so far.

coveralls · 2018-11-14T04:23:47Z

Coverage increased (+0.02%) to 84.237% when pulling 04aa547 on josef-pkt:vectorized_ols into 14fb572 on statsmodels:master.

jbrockmendel · 2018-12-08T03:01:20Z

statsmodels/regression/special_linear_model.py

+    r"""results class for vectorized OLS
+    """
+
+    @cache_writable()


Why writable? This is used in very few places

jbrockmendel · 2018-12-08T03:01:45Z

statsmodels/regression/special_linear_model.py

+        return (wendog * wendog).sum(0)
+
+    def conf_int(self, alpha=.05, cols=None):
+        #print('using OLSVectorizedResults.conf_int')


statsmodels/regression/special_linear_model.py

jbrockmendel · 2018-12-08T03:02:57Z

statsmodels/regression/special_linear_model.py

+        Examples
+        --------
+        >>> import numpy as np
+        >>> import statsmodels.api as sm


If no examples, remove section

jbrockmendel · 2018-12-08T03:05:05Z

statsmodels/regression/special_linear_model.py

+
+
+class OLSVectorized(OLS):
+    _results_class = OLSVectorizedResults


Doing this is better than setting it in init, but to make this work you had to put the model class below the results class, contrary to the pattern. Prettier to follow statespace/regimeswitching precedent and make this a property.

statsmodels/regression/tests/test_special_linear_model.py

jbrockmendel · 2018-12-08T03:06:03Z

statsmodels/regression/tests/test_special_linear_model.py

+        res2_list = self.res2_list
+
+
+        attrs = ['params', 'scale', 'bse', 'tvalues', 'pvalues', 'ssr', 'centered_tss',


Parametrize

statsmodels/regression/tests/test_special_linear_model.py

statsmodels/regression/special_linear_model.py

jbrockmendel · 2018-12-08T03:08:08Z

Big picture, what’s the difference between VectorizedOLS and SUR?

josef-pkt · 2018-12-08T03:21:06Z

Big picture, what’s the difference between VectorizedOLS and SUR?

It's a special case with limited functionality that can be vectorized.
General SUR allows for different exog per equation and requires full kronecker or block diagonal long form.
SUR with common exog as in VAR still includes cov_resid and joint inference over all parameters. This special case will be Multivariate OLS as used for MANOVA.

OLSVectorized assumes common exog, but assumes cov_resid is diagonal or irrelevant for inference and only provides inference within an equation but not for the joint params. That is, it is just a computational improvement over running many (univariate endog) OLS in a loop.

josef-pkt · 2019-12-02T18:55:15Z

application residual bootstrap, exog is the same for all bootstrap samples

"The code implements the multiplier bootstrap efficiently by executing all 10,000 regressions simultaneously, exploiting the fact that the regressors are common across the bootstrap replications."

Bruce E. Hansen 2017: Regression Kink With an Unknown Threshold

bootstrap is used to get simulated p-values that are not pivotal

josef-pkt · 2021-01-26T17:07:14Z

test for joint hypothesis wald_test f_test seems to be missing

zoj613 · 2022-02-09T14:42:09Z

I this still planned for a merge? @josef-pkt
I am faced with code like the following:

    import statsmodels.api as sm

    n_rows, n_col = some_array.shape
    X = sm.add_constant(temp)
    for j in range(n_cols):
        Y = some_array[:, j]
        model = sm.OLS(Y, X)
        results = model.fit()
       # append results to some  variable

This snippet is from a function that is called in another loop many times. It is very slow and I was wondering if there is another way I can speed this loop up?

josef-pkt · 2022-02-09T14:49:54Z

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X y (which would require working in batches).
The main issue is what more is in your loop. What statistics do you need from your results?

zoj613 · 2022-02-09T14:54:29Z

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X (which would require working in batches). The main issue is what more is in your loop. What statistics do you need from your results?

X is a 1d array whose size is equal to the number of rows of some_array. How could this be transformed into a single regression? The main statistics I want to extract are: params, rsquared and pvalues.

josef-pkt · 2022-02-09T15:19:50Z

X is a 1d array

your don't have a constant in your regression?

the basic idea is the
params = pinv(X) @ y
computes the parameters for vectorized regression of each column of y on x.

The main problem is that the residual scale and standard errors are different for each y, which needs some workarounds to get, for example, pvalues.

josef-pkt · 2022-02-09T15:26:01Z

If you have only one explanatory variable, then explicit formulas are likely faster. I only looked at multiple regression here.

pep8speaks · 2022-02-14T20:43:53Z

Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file statsmodels/regression/special_linear_model.py:

Line 14:1: F401 'numpy.testing.assert_allclose' imported but unused

zoj613 · 2022-02-14T20:53:31Z

@zoj613 Yes, the plan is still to merge this.

The main open question is what should be in the results class, e.g. vectorized t-test.

Your loop can be done in a single regression which should be much faster unless there are memory problems with huge X y (which would require working in batches). The main issue is what more is in your loop. What statistics do you need from your results?

Thanks for the suggestion. I was able to get a fast solution by just using the matrix OLS formulation and take advantage of numpy's broadcasting rules. The rsquared and pvalues were fairly easy to compute by hand using scipy's t distribution and formulae for the various sums of squares.

josef-pkt · 2022-02-14T21:13:59Z

rebased
I found my original script when I wrote this.
The example runs, but I'm not sure about the t_test yet.

all main attributes correspond to the loop at rtol 1e-13, I had around with a failure when agreement was around 5e-13.

attrs = ['params', 'scale', 'bse', 'tvalues', 'pvalues', 'ssr', 'centered_tss',
         'rsquared', 'llf', 'aic', 'bic', 'fvalue', 'f_pvalue']

It should be easy to merge, in case we raise notimplementederror with wald_test and similar.

other methods, e.g. get_prediction ?
predict should work, maybe not get_prediction
we raise in get_influence

lgtm-com · 2022-02-14T21:24:19Z

This pull request introduces 1 alert when merging 5f5f3b3 into 152e27d - view on LGTM.com

new alerts:

1 for Unused import

josef-pkt · 2022-02-14T21:47:35Z

t_test has unit tests, so only joint hypothesis are not supported
t_test is largely copied code and adjusted for this case

no checks for predict yet

no pandas results wrapper. I might skip that here.

summary work but only has the params table.
diaplaying vectorized top table doesn't make sence
We could add a datframe with main results in rows, e.g. rsquared, llf

update

in my example:
res.predict works
res.get_prediction raises error on broadcasting

josef-pkt added type-enh comp-regression labels Nov 14, 2018

josef-pkt mentioned this pull request Nov 14, 2018

ENH: Vectorized OLS, simplified Multivariate Linear Regression using OLS #4771

Open

josef-pkt mentioned this pull request Dec 7, 2018

ENH: OLS/WLS based on summary statistic, (cov_data) #3901

Open

jbrockmendel reviewed Dec 8, 2018

View reviewed changes

statsmodels/regression/special_linear_model.py Show resolved Hide resolved

jbrockmendel reviewed Dec 8, 2018

View reviewed changes

statsmodels/regression/tests/test_special_linear_model.py Show resolved Hide resolved

jbrockmendel reviewed Dec 8, 2018

View reviewed changes

statsmodels/regression/tests/test_special_linear_model.py Show resolved Hide resolved

jbrockmendel reviewed Dec 8, 2018

View reviewed changes

statsmodels/regression/special_linear_model.py Show resolved Hide resolved

josef-pkt mentioned this pull request Aug 2, 2019

OLS compare_f_test in parallel #6049

Closed

bashtage mentioned this pull request Oct 17, 2019

Vectorized OLS #6201

Closed

josef-pkt added this to the 0.14 milestone Feb 9, 2022

josef-pkt mentioned this pull request Feb 13, 2022

SUMM roadmap 0.14 josef #7720

Open

20 tasks

ENH: add OLSVectorized, initial version

5f5f3b3

josef-pkt force-pushed the vectorized_ols branch from 04aa547 to 5f5f3b3 Compare February 14, 2022 20:43

josef-pkt mentioned this pull request Feb 17, 2022

ENH/Design vectorized hypothesis tests #6581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add OLSVectorized, initial version #5382

ENH: add OLSVectorized, initial version #5382

josef-pkt commented Nov 14, 2018

coveralls commented Nov 14, 2018

jbrockmendel Dec 8, 2018

jbrockmendel Dec 8, 2018

jbrockmendel Dec 8, 2018

jbrockmendel Dec 8, 2018

jbrockmendel Dec 8, 2018

jbrockmendel commented Dec 8, 2018

josef-pkt commented Dec 8, 2018

josef-pkt commented Dec 2, 2019

josef-pkt commented Jan 26, 2021

zoj613 commented Feb 9, 2022

josef-pkt commented Feb 9, 2022 •

edited

zoj613 commented Feb 9, 2022

josef-pkt commented Feb 9, 2022

josef-pkt commented Feb 9, 2022

pep8speaks commented Feb 14, 2022

zoj613 commented Feb 14, 2022 •

edited

josef-pkt commented Feb 14, 2022

lgtm-com bot commented Feb 14, 2022

josef-pkt commented Feb 14, 2022 •

edited



		class OLSVectorized(OLS):
		_results_class = OLSVectorizedResults

		res2_list = self.res2_list


		attrs = ['params', 'scale', 'bse', 'tvalues', 'pvalues', 'ssr', 'centered_tss',

ENH: add OLSVectorized, initial version #5382

Are you sure you want to change the base?

ENH: add OLSVectorized, initial version #5382

Conversation

josef-pkt commented Nov 14, 2018

coveralls commented Nov 14, 2018

jbrockmendel Dec 8, 2018

Choose a reason for hiding this comment

jbrockmendel Dec 8, 2018

Choose a reason for hiding this comment

jbrockmendel Dec 8, 2018

Choose a reason for hiding this comment

jbrockmendel Dec 8, 2018

Choose a reason for hiding this comment

jbrockmendel Dec 8, 2018

Choose a reason for hiding this comment

jbrockmendel commented Dec 8, 2018

josef-pkt commented Dec 8, 2018

josef-pkt commented Dec 2, 2019

josef-pkt commented Jan 26, 2021

zoj613 commented Feb 9, 2022

josef-pkt commented Feb 9, 2022 • edited

zoj613 commented Feb 9, 2022

josef-pkt commented Feb 9, 2022

josef-pkt commented Feb 9, 2022

pep8speaks commented Feb 14, 2022

zoj613 commented Feb 14, 2022 • edited

josef-pkt commented Feb 14, 2022

lgtm-com bot commented Feb 14, 2022

josef-pkt commented Feb 14, 2022 • edited

josef-pkt commented Feb 9, 2022 •

edited

zoj613 commented Feb 14, 2022 •

edited

josef-pkt commented Feb 14, 2022 •

edited