PERF: Add a new factor that just computes beta. #2021

ssanderson · 2017-11-21T07:11:04Z

This is a little over 1100x faster (!) than using
RollingLinearRegressionOfReturns on my machine.

Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:

Tue Nov 21 02:05:48 2017    pipebench/perf/betas.stats

         57724856 function calls (57689155 primitive calls) in 92.342 seconds

   Ordered by: cumulative time
   List reduced from 1212 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.612    0.029   95.461    4.546 statistical.py:194(compute)
   172201    0.407    0.000   94.843    0.001 statistical.py:201(regress)
       21    0.048    0.002    0.082    0.004 statistical.py:500(compute)

The compute function that takes 95 seconds is RollingLinearRetressionOfReturns.
The compute function that takes 0.082 seconds is the new implementation.

ssanderson · 2017-11-21T07:19:40Z

I think the only significant decision that needs to be made here is what we want to do if there are NaNs in the input. The current implementation uses nanmean everywhere possible, which means it produces values more often than the existing RollingLinearRegressionOfReturns, (which internally calls scipy.stats.linregress, which is, as it turns out, not particularly efficient when called 172,000 times).

cc @Jstauth @jmccorriston @ahgnaw @twiecki for thoughts on how we want to handle missing data here. Note that I'm not trying to port our downstream shrinkage beta (I think we should also do that, but the use-case here is to just provide the simplest and fastest reasonable implementation of beta, in particular with an eye toward use in portfolio optimization.)

This is a little over 1100x faster (!) than using `RollingLinearRegressionOfReturns` on my machine. Profiling output for a 1-month pipeline using both terms for with a 90-day lookback: ``` Tue Nov 21 02:05:48 2017 pipebench/perf/betas.stats 57724856 function calls (57689155 primitive calls) in 92.342 seconds Ordered by: cumulative time List reduced from 1212 to 3 due to restriction <'statistical.py'> ncalls tottime percall cumtime percall filename:lineno(function) 21 0.612 0.029 95.461 4.546 statistical.py:194(compute) 172201 0.407 0.000 94.843 0.001 statistical.py:201(regress) 21 0.048 0.002 0.082 0.004 statistical.py:500(compute) ```

ssanderson · 2017-11-21T07:22:32Z

zipline/pipeline/factors/statistical.py

@@ -148,11 +155,6 @@ class RollingLinearRegression(CustomFactor, SingleInputMixin):
        The factor/slice whose columns are the predictor/independent variable
        of each regression with `dependent`. If `independent` is a Factor,
        regressions are computed asset-wise.
-    independent : zipline.pipeline.Term with a numeric dtype


This is an unrelated change to the rest of the PR. We had this documented twice.

ssanderson · 2017-11-21T07:37:41Z

Hilarious performance note: in the current implementation more than half of the total time is spent constructing namedtuple types:

Function                                called...
                                            ncalls  tottime  cumtime
_stats_mstats_common.py:11(linregress)  ->  172201    3.918   15.788  _distn_infrastructure.py:1720(sf)
                                            172201   44.965   58.895  collections.py:305(namedtuple)
                                            344402    0.445    5.133  fromnumeric.py:2796(mean)
                                            172201    2.672   10.022  function_base.py:2298(cov)
                                            344402    0.209    0.272  numeric.py:414(asarray)
                                            172201    0.028    0.028  {len}

coveralls · 2017-11-21T07:44:43Z

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

coveralls · 2017-11-21T07:45:01Z

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

analicia · 2017-11-21T16:27:32Z

In empyrical.beta_aligned() we remove any date-pairs with np.nan in them. Here we might want to add a minimum number of non-nan date pairs required to calculate beta.

jmccorriston · 2017-11-21T17:20:32Z

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

ssanderson · 2017-11-21T17:26:27Z

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

Any asset with a NaN at any point in the lookback window ends up NaN.

Throw out observations that have nans in either array in ``SimpleBeta``.

dmichalowicz

@ssanderson Pending the TODO on missing data, this looks good to me.

dmichalowicz · 2017-11-21T17:22:29Z

zipline/pipeline/factors/statistical.py

+class SimpleBeta(CustomFactor, StandardOutputs):
+    """
+    Factor producing the slope of a regression line between each asset's daily
+    returns the daily returns of a single "target" asset.


"...each asset's daily returns and the daily returns of..." (missing "and")

ssanderson · 2017-11-21T18:04:23Z

Pushed a change that makes the behavior here match empyrical exactly around missing data. Handling NaNs imposes around a 2.5x performance penalty (from 0.9 seconds per year to 2.2 on my machine). That's still a huge win compared to 90+ seconds per month with the old implementation. We could probably claw back a good chunk of that performance by dropping into Cython for the nan-aware handling, but I think this is probably a good first cut if we're happy with empyrical's nan-handling behavior, which is to drop any observation in the regression where either the independent or dependent variable has missing data.

Before:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    3.544    3.544 engine.py:445(compute_chunk)
        2    0.027    0.013    1.469    0.735 mixins.py:189(_compute)
      253    0.518    0.002    0.917    0.004 statistical.py:512(compute)

After:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    4.909    4.909 engine.py:445(compute_chunk)
        2    0.028    0.014    2.844    1.422 mixins.py:189(_compute)
      253    0.003    0.000    2.252    0.009 statistical.py:515(compute)

ssanderson · 2017-11-21T18:14:07Z

Hrm, though, thinking about this a bit more, I think we need to require a minimum number of data points if we're handling nans in this way (as @ahgnaw suggests above). If we don't, we'll produce crazy values for regressions where we only have 2-3 data points.

coveralls · 2017-11-21T18:21:15Z

Coverage increased (+0.02%) to 87.356% when pulling 0c66d72 on faster-beta into 079ea3d on master.

jmccorriston · 2017-11-21T18:30:40Z

Oh, that's not great. I like Ana's idea of a minimum number or minimum percentage of non-nan date pairs.

Make `SimpleBeta` produce values when there are missing returns observations. By default, we allow up to 25% of the returns observations to be missing, so that we can start producing betas earlier for recently IPO-ed stocks.

coveralls · 2017-11-22T23:18:18Z

Coverage decreased (-0.05%) to 87.296% when pulling 8c8df85 on faster-beta into 079ea3d on master.

Asset type has a different repr in py3.

coveralls · 2017-11-22T23:47:59Z

Coverage increased (+0.02%) to 87.36% when pulling 71789ff on faster-beta into 079ea3d on master.

dmichalowicz · 2017-11-27T14:29:02Z

@ssanderson Let me know if this needs another round of review

analicia · 2017-11-27T15:23:04Z

zipline/pipeline/factors/statistical.py

+        )
+
+
+def vectorized_beta(dependents, independent, allowed_missing, out=None):


I couldn't find another place where vectorized_beta might be useful at the moment, however at a later date we might want to move it to empyrical.

I don't think empyrical would use the vectorized part of this anywhere right now, but I suspect that the direct calculation of covariance and variance here is faster than what we currently do in empyrical, because it avoids making copies of the data and it avoids calculating parts of the covariance matrix that we don't need. I'm not super worried about porting this in the near term, because I don't think anything is currently bottlenecked on empyrical's beta calculations, but if that becomes a bottleneck this is probably a good place to look to speed things up.

analicia · 2017-11-27T15:29:57Z

tests/pipeline/test_statistical.py

+
+    def test_allowed_missing_doesnt_double_count(self):
+        # Test that allowed_missing only counts a row as missing one
+        # observation if it's missing in both the dependent and independent


should "both" be "either"? Dates for which the dependent or independent (or both) data is nan cannot be included in the beta calculation.

I think this wording is correct. This test is checking is that if a row has a NaN in both the dependent and independent, then we only count that row as adding 1 toward the number of required missing values.

ssanderson · 2017-11-27T16:07:23Z

@dmichalowicz yeah, I think this is ready for another pass if you can take another look.

coveralls · 2017-11-27T17:08:21Z

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

coveralls · 2017-11-27T17:08:21Z

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

dmichalowicz

@ssanderson Left some more comments. My main question is with how we "nan" out the independent variable but don't do the opposite for the dependent

dmichalowicz · 2017-11-27T16:57:21Z

zipline/pipeline/factors/statistical.py

+            window_length=2,
+            mask=(AssetExists() | SingleAsset(asset=target)),
+        )
+        allowed_missing_count = int(np.floor(


Why do you need np.floor? Doesn't int already floor?

dmichalowicz · 2017-11-27T16:58:11Z

zipline/pipeline/factors/statistical.py

+
+    @expect_types(
+        regression_length=int,
+        target=Asset,


super nitpicky, but could you make these the same order as the function definition?

dmichalowicz · 2017-11-27T16:59:52Z

zipline/pipeline/factors/statistical.py

+        return "{}(window_length={}, allowed_missing={})".format(
+            type(self).__name__,
+            self.window_length,
+            self.params['allowed_missing'],


Should this be "allowed_missing_count"?

dmichalowicz · 2017-11-27T17:23:42Z

zipline/pipeline/factors/statistical.py

+        independent,
+    )
+
+    # Calculate beta as Cov(X, Y) / Cov(Y, Y).


I think you mean Cov(X, Y) / Cov(X, X) (The calculation below is correct, just this comment is off)

dmichalowicz · 2017-11-27T17:26:15Z

zipline/pipeline/factors/statistical.py

+    # shape: (M,)
+    independent_variances = nanmean(ind_residual ** 2, axis=0)
+
+    # shape: (M,)


Thanks for these shape comments!

dmichalowicz · 2017-11-27T17:38:11Z

tests/pipeline/test_statistical.py

+
+        # Sanity check that we actually inserted some nans.
+        self.assertTrue(np.count_nonzero(np.isnan(dependents)) > 0)
+        self.assertTrue(np.count_nonzero(np.isnan(independent)) > 0)


It's a very small chance, but doesn't this mean this test could fail randomly?

We're seeding the rng, so this is still deterministic.

dmichalowicz · 2017-11-27T17:48:14Z

tests/pipeline/test_statistical.py

+        assert_equal(np.isnan(result5),
+                     np.array([False, False, True, False, False]))
+
+        # With six allowed missing values, everything should produce a value.


With five allowed...

dmichalowicz · 2017-11-27T17:49:35Z

zipline/pipeline/factors/statistical.py

+        isnan(dependents),
+        nan,
+        independent,
+    )


Should you do a similar operation on the dependent matrix? Otherwise it will be using rows in its residual calculation that the independent variable does not use.

Centering dependent turns out not to matter. See 6bacafa for a lengthy description. (We discussed this in person, but putting the link here for posterity.)

The usual formula for covariance is:: mean((X - mean(X)) * (Y - mean(Y))) This is equivalent, however, to just doing: mean((X - mean(X)) * Y) Proof: Let X_res = (X - mean(X)). We have: mean(X_res * (Y - mean(Y))) = mean(X_res * (Y - mean(Y))) (1) = mean((X_res * Y) - (X_res * mean(Y))) (2) = mean(X_res * Y) - mean(X_res * mean(Y)) (3) = mean(X_res * Y) - mean(X_res) * mean(Y) (4) = mean(X_res * Y) - 0 * mean(Y) (5) = mean(X_res * Y) The tricky step in the above derivation is step (4). We know that mean(X_res) is zero because, for any X: mean(X - mean(X)) = mean(X) - mean(X) = 0.

- Fix incorrect variable names and comments. - Add test coverage for broken repr function.

coveralls · 2017-11-27T22:14:22Z

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

coveralls · 2017-11-27T22:25:17Z

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

ssanderson requested a review from dmichalowicz November 21, 2017 07:11

ssanderson force-pushed the faster-beta branch from dbe276b to 77cb3fd Compare November 21, 2017 07:20

ssanderson commented Nov 21, 2017

View reviewed changes

ENH: Handle missing data in SimpleBeta.

0c66d72

Throw out observations that have nans in either array in ``SimpleBeta``.

dmichalowicz approved these changes Nov 21, 2017

View reviewed changes

ENH: Add missing value tolerance for SimpleBeta.

8c8df85

Make `SimpleBeta` produce values when there are missing returns observations. By default, we allow up to 25% of the returns observations to be missing, so that we can start producing betas earlier for recently IPO-ed stocks.

TEST: Fix test failure in python 3.

71789ff

Asset type has a different repr in py3.

analicia reviewed Nov 27, 2017

View reviewed changes

DOC: Minor doc tweaks.

92275ad

dmichalowicz reviewed Nov 27, 2017

View reviewed changes

Scott Sanderson added 4 commits November 27, 2017 16:34

TEST: Test cleanup for SimpleBeta.

5fb66c4

- Fix incorrect variable names and comments. - Add test coverage for broken repr function.

MAINT: Remove unnecessary floor() call.

e9ac381

MAINT: Put expect_types in same order as __new__.

3fdf2c8

PERF: We already have nan cached as a local.

4735f3b

ssanderson merged commit a4539f4 into master Nov 27, 2017

ssanderson deleted the faster-beta branch November 27, 2017 22:25

		)


		def vectorized_beta(dependents, independent, allowed_missing, out=None):

PERF: Add a new factor that just computes beta. #2021

PERF: Add a new factor that just computes beta. #2021

Conversation

ssanderson commented Nov 21, 2017 • edited

ssanderson commented Nov 21, 2017

Choose a reason for hiding this comment

ssanderson commented Nov 21, 2017

coveralls commented Nov 21, 2017 • edited

coveralls commented Nov 21, 2017 • edited

analicia commented Nov 21, 2017

jmccorriston commented Nov 21, 2017

ssanderson commented Nov 21, 2017

dmichalowicz left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssanderson commented Nov 21, 2017 • edited

ssanderson commented Nov 21, 2017 • edited

coveralls commented Nov 21, 2017 • edited

jmccorriston commented Nov 21, 2017

coveralls commented Nov 22, 2017

coveralls commented Nov 22, 2017 • edited

dmichalowicz commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssanderson Nov 27, 2017 • edited

Choose a reason for hiding this comment

ssanderson commented Nov 27, 2017

coveralls commented Nov 27, 2017

coveralls commented Nov 27, 2017

dmichalowicz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Nov 27, 2017 • edited

coveralls commented Nov 27, 2017 • edited

ssanderson commented Nov 21, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

dmichalowicz left a comment •

edited

ssanderson commented Nov 21, 2017 •

edited

ssanderson commented Nov 21, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

coveralls commented Nov 22, 2017 •

edited

ssanderson Nov 27, 2017 •

edited

coveralls commented Nov 27, 2017 •

edited

coveralls commented Nov 27, 2017 •

edited