New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Add a new factor that just computes beta. #2021

Merged
merged 10 commits into from Nov 27, 2017

Conversation

Projects
None yet
5 participants
@ssanderson
Member

ssanderson commented Nov 21, 2017

This is a little over 1100x faster (!) than using
RollingLinearRegressionOfReturns on my machine.

Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:

Tue Nov 21 02:05:48 2017    pipebench/perf/betas.stats

         57724856 function calls (57689155 primitive calls) in 92.342 seconds

   Ordered by: cumulative time
   List reduced from 1212 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.612    0.029   95.461    4.546 statistical.py:194(compute)
   172201    0.407    0.000   94.843    0.001 statistical.py:201(regress)
       21    0.048    0.002    0.082    0.004 statistical.py:500(compute)
  • The compute function that takes 95 seconds is RollingLinearRetressionOfReturns.
  • The compute function that takes 0.082 seconds is the new implementation.

@ssanderson ssanderson requested a review from dmichalowicz Nov 21, 2017

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 21, 2017

I think the only significant decision that needs to be made here is what we want to do if there are NaNs in the input. The current implementation uses nanmean everywhere possible, which means it produces values more often than the existing RollingLinearRegressionOfReturns, (which internally calls scipy.stats.linregress, which is, as it turns out, not particularly efficient when called 172,000 times).

cc @Jstauth @jmccorriston @ahgnaw @twiecki for thoughts on how we want to handle missing data here. Note that I'm not trying to port our downstream shrinkage beta (I think we should also do that, but the use-case here is to just provide the simplest and fastest reasonable implementation of beta, in particular with an eye toward use in portfolio optimization.)

PERF: Add a new factor that just computes beta.
This is a little over 1100x faster (!) than using
`RollingLinearRegressionOfReturns` on my machine.

Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:

```
Tue Nov 21 02:05:48 2017    pipebench/perf/betas.stats

         57724856 function calls (57689155 primitive calls) in 92.342 seconds

   Ordered by: cumulative time
   List reduced from 1212 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.612    0.029   95.461    4.546 statistical.py:194(compute)
   172201    0.407    0.000   94.843    0.001 statistical.py:201(regress)
       21    0.048    0.002    0.082    0.004 statistical.py:500(compute)
```

@ssanderson ssanderson force-pushed the faster-beta branch from dbe276b to 77cb3fd Nov 21, 2017

@@ -148,11 +155,6 @@ class RollingLinearRegression(CustomFactor, SingleInputMixin):
The factor/slice whose columns are the predictor/independent variable
of each regression with `dependent`. If `independent` is a Factor,
regressions are computed asset-wise.
independent : zipline.pipeline.Term with a numeric dtype

This comment has been minimized.

@ssanderson

ssanderson Nov 21, 2017

Member

This is an unrelated change to the rest of the PR. We had this documented twice.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 21, 2017

Hilarious performance note: in the current implementation more than half of the total time is spent constructing namedtuple types:

Function                                called...
                                            ncalls  tottime  cumtime
_stats_mstats_common.py:11(linregress)  ->  172201    3.918   15.788  _distn_infrastructure.py:1720(sf)
                                            172201   44.965   58.895  collections.py:305(namedtuple)
                                            344402    0.445    5.133  fromnumeric.py:2796(mean)
                                            172201    2.672   10.022  function_base.py:2298(cov)
                                            344402    0.209    0.272  numeric.py:414(asarray)
                                            172201    0.028    0.028  {len}
@coveralls

This comment has been minimized.

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

@coveralls

This comment has been minimized.

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

@ahgnaw

This comment has been minimized.

Contributor

ahgnaw commented Nov 21, 2017

In empyrical.beta_aligned() we remove any date-pairs with np.nan in them. Here we might want to add a minimum number of non-nan date pairs required to calculate beta.

@jmccorriston

This comment has been minimized.

Contributor

jmccorriston commented Nov 21, 2017

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 21, 2017

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

Any asset with a NaN at any point in the lookback window ends up NaN.

ENH: Handle missing data in SimpleBeta.
Throw out observations that have nans in either array in
``SimpleBeta``.
@dmichalowicz

dmichalowicz approved these changes Nov 21, 2017 edited

@ssanderson Pending the TODO on missing data, this looks good to me.

class SimpleBeta(CustomFactor, StandardOutputs):
"""
Factor producing the slope of a regression line between each asset's daily
returns the daily returns of a single "target" asset.

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 21, 2017

Contributor

"...each asset's daily returns and the daily returns of..." (missing "and")

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 21, 2017

Pushed a change that makes the behavior here match empyrical exactly around missing data. Handling NaNs imposes around a 2.5x performance penalty (from 0.9 seconds per year to 2.2 on my machine). That's still a huge win compared to 90+ seconds per month with the old implementation. We could probably claw back a good chunk of that performance by dropping into Cython for the nan-aware handling, but I think this is probably a good first cut if we're happy with empyrical's nan-handling behavior, which is to drop any observation in the regression where either the independent or dependent variable has missing data.

Before:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    3.544    3.544 engine.py:445(compute_chunk)
        2    0.027    0.013    1.469    0.735 mixins.py:189(_compute)
      253    0.518    0.002    0.917    0.004 statistical.py:512(compute)

After:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    4.909    4.909 engine.py:445(compute_chunk)
        2    0.028    0.014    2.844    1.422 mixins.py:189(_compute)
      253    0.003    0.000    2.252    0.009 statistical.py:515(compute)
@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 21, 2017

Hrm, though, thinking about this a bit more, I think we need to require a minimum number of data points if we're handling nans in this way (as @ahgnaw suggests above). If we don't, we'll produce crazy values for regressions where we only have 2-3 data points.

@coveralls

This comment has been minimized.

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.02%) to 87.356% when pulling 0c66d72 on faster-beta into 079ea3d on master.

@jmccorriston

This comment has been minimized.

Contributor

jmccorriston commented Nov 21, 2017

Oh, that's not great. I like Ana's idea of a minimum number or minimum percentage of non-nan date pairs.

ENH: Add missing value tolerance for SimpleBeta.
Make `SimpleBeta` produce values when there are missing returns
observations.

By default, we allow up to 25% of the returns observations to be
missing, so that we can start producing betas earlier for recently
IPO-ed stocks.
@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2017

Coverage Status

Coverage decreased (-0.05%) to 87.296% when pulling 8c8df85 on faster-beta into 079ea3d on master.

TEST: Fix test failure in python 3.
Asset type has a different repr in py3.
@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 71789ff on faster-beta into 079ea3d on master.

@dmichalowicz

This comment has been minimized.

Contributor

dmichalowicz commented Nov 27, 2017

@ssanderson Let me know if this needs another round of review

)
def vectorized_beta(dependents, independent, allowed_missing, out=None):

This comment has been minimized.

@ahgnaw

ahgnaw Nov 27, 2017

Contributor

I couldn't find another place where vectorized_beta might be useful at the moment, however at a later date we might want to move it to empyrical.

This comment has been minimized.

@ssanderson

ssanderson Nov 27, 2017

Member

I don't think empyrical would use the vectorized part of this anywhere right now, but I suspect that the direct calculation of covariance and variance here is faster than what we currently do in empyrical, because it avoids making copies of the data and it avoids calculating parts of the covariance matrix that we don't need. I'm not super worried about porting this in the near term, because I don't think anything is currently bottlenecked on empyrical's beta calculations, but if that becomes a bottleneck this is probably a good place to look to speed things up.

def test_allowed_missing_doesnt_double_count(self):
# Test that allowed_missing only counts a row as missing one
# observation if it's missing in both the dependent and independent

This comment has been minimized.

@ahgnaw

ahgnaw Nov 27, 2017

Contributor

should "both" be "either"? Dates for which the dependent or independent (or both) data is nan cannot be included in the beta calculation.

This comment has been minimized.

@ssanderson

ssanderson Nov 27, 2017

Member

I think this wording is correct. This test is checking is that if a row has a NaN in both the dependent and independent, then we only count that row as adding 1 toward the number of required missing values.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 27, 2017

@dmichalowicz yeah, I think this is ready for another pass if you can take another look.

@coveralls

This comment has been minimized.

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

@dmichalowicz

@ssanderson Left some more comments. My main question is with how we "nan" out the independent variable but don't do the opposite for the dependent

window_length=2,
mask=(AssetExists() | SingleAsset(asset=target)),
)
allowed_missing_count = int(np.floor(

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

Why do you need np.floor? Doesn't int already floor?

@expect_types(
regression_length=int,
target=Asset,

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

super nitpicky, but could you make these the same order as the function definition?

return "{}(window_length={}, allowed_missing={})".format(
type(self).__name__,
self.window_length,
self.params['allowed_missing'],

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

Should this be "allowed_missing_count"?

independent,
)
# Calculate beta as Cov(X, Y) / Cov(Y, Y).

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

I think you mean Cov(X, Y) / Cov(X, X) (The calculation below is correct, just this comment is off)

# shape: (M,)
independent_variances = nanmean(ind_residual ** 2, axis=0)
# shape: (M,)

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

Thanks for these shape comments!

# Sanity check that we actually inserted some nans.
self.assertTrue(np.count_nonzero(np.isnan(dependents)) > 0)
self.assertTrue(np.count_nonzero(np.isnan(independent)) > 0)

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

It's a very small chance, but doesn't this mean this test could fail randomly?

This comment has been minimized.

@ssanderson

ssanderson Nov 27, 2017

Member

We're seeding the rng, so this is still deterministic.

assert_equal(np.isnan(result5),
np.array([False, False, True, False, False]))
# With six allowed missing values, everything should produce a value.

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

With five allowed...

isnan(dependents),
nan,
independent,
)

This comment has been minimized.

@dmichalowicz

dmichalowicz Nov 27, 2017

Contributor

Should you do a similar operation on the dependent matrix? Otherwise it will be using rows in its residual calculation that the independent variable does not use.

This comment has been minimized.

@ssanderson

ssanderson Nov 27, 2017

Member

Centering dependent turns out not to matter. See 6bacafa for a lengthy description. (We discussed this in person, but putting the link here for posterity.)

PERF: Don't center dependents in vectorized_beta.
The usual formula for covariance is::

   mean((X - mean(X)) * (Y - mean(Y)))

This is equivalent, however, to just doing:

   mean((X - mean(X)) * Y)

Proof:

    Let X_res = (X - mean(X)).
    We have:

    mean(X_res * (Y - mean(Y))) = mean(X_res * (Y - mean(Y)))
                            (1) = mean((X_res * Y) - (X_res * mean(Y)))
                            (2) = mean(X_res * Y) - mean(X_res * mean(Y))
                            (3) = mean(X_res * Y) - mean(X_res) * mean(Y)
                            (4) = mean(X_res * Y) - 0 * mean(Y)
                            (5) = mean(X_res * Y)

The tricky step in the above derivation is step (4). We know that
mean(X_res) is zero because, for any X:

    mean(X - mean(X)) = mean(X) - mean(X) = 0.

ssanderson added some commits Nov 27, 2017

TEST: Test cleanup for SimpleBeta.
- Fix incorrect variable names and comments.
- Add test coverage for broken repr function.
@coveralls

This comment has been minimized.

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

@coveralls

This comment has been minimized.

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

@ssanderson ssanderson merged commit a4539f4 into master Nov 27, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the faster-beta branch Nov 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment