New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Add a new factor that just computes beta. #2021
Conversation
I think the only significant decision that needs to be made here is what we want to do if there are NaNs in the input. The current implementation uses cc @Jstauth @jmccorriston @ahgnaw @twiecki for thoughts on how we want to handle missing data here. Note that I'm not trying to port our downstream shrinkage beta (I think we should also do that, but the use-case here is to just provide the simplest and fastest reasonable implementation of beta, in particular with an eye toward use in portfolio optimization.) |
This is a little over 1100x faster (!) than using `RollingLinearRegressionOfReturns` on my machine. Profiling output for a 1-month pipeline using both terms for with a 90-day lookback: ``` Tue Nov 21 02:05:48 2017 pipebench/perf/betas.stats 57724856 function calls (57689155 primitive calls) in 92.342 seconds Ordered by: cumulative time List reduced from 1212 to 3 due to restriction <'statistical.py'> ncalls tottime percall cumtime percall filename:lineno(function) 21 0.612 0.029 95.461 4.546 statistical.py:194(compute) 172201 0.407 0.000 94.843 0.001 statistical.py:201(regress) 21 0.048 0.002 0.082 0.004 statistical.py:500(compute) ```
dbe276b
to
77cb3fd
Compare
@@ -148,11 +155,6 @@ class RollingLinearRegression(CustomFactor, SingleInputMixin): | |||
The factor/slice whose columns are the predictor/independent variable | |||
of each regression with `dependent`. If `independent` is a Factor, | |||
regressions are computed asset-wise. | |||
independent : zipline.pipeline.Term with a numeric dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated change to the rest of the PR. We had this documented twice.
Hilarious performance note: in the current implementation more than half of the total time is spent constructing namedtuple types:
|
In |
What is the current behavior of |
Any asset with a NaN at any point in the lookback window ends up NaN. |
Throw out observations that have nans in either array in ``SimpleBeta``.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssanderson Pending the TODO
on missing data, this looks good to me.
class SimpleBeta(CustomFactor, StandardOutputs): | ||
""" | ||
Factor producing the slope of a regression line between each asset's daily | ||
returns the daily returns of a single "target" asset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...each asset's daily returns and the daily returns of..." (missing "and")
Pushed a change that makes the behavior here match empyrical exactly around missing data. Handling NaNs imposes around a 2.5x performance penalty (from 0.9 seconds per year to 2.2 on my machine). That's still a huge win compared to 90+ seconds per month with the old implementation. We could probably claw back a good chunk of that performance by dropping into Cython for the nan-aware handling, but I think this is probably a good first cut if we're happy with empyrical's nan-handling behavior, which is to drop any observation in the regression where either the independent or dependent variable has missing data. Before:
After:
|
Hrm, though, thinking about this a bit more, I think we need to require a minimum number of data points if we're handling nans in this way (as @ahgnaw suggests above). If we don't, we'll produce crazy values for regressions where we only have 2-3 data points. |
Oh, that's not great. I like Ana's idea of a minimum number or minimum percentage of non-nan date pairs. |
Make `SimpleBeta` produce values when there are missing returns observations. By default, we allow up to 25% of the returns observations to be missing, so that we can start producing betas earlier for recently IPO-ed stocks.
Asset type has a different repr in py3.
@ssanderson Let me know if this needs another round of review |
) | ||
|
||
|
||
def vectorized_beta(dependents, independent, allowed_missing, out=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find another place where vectorized_beta
might be useful at the moment, however at a later date we might want to move it to empyrical
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think empyrical would use the vectorized
part of this anywhere right now, but I suspect that the direct calculation of covariance and variance here is faster than what we currently do in empyrical, because it avoids making copies of the data and it avoids calculating parts of the covariance matrix that we don't need. I'm not super worried about porting this in the near term, because I don't think anything is currently bottlenecked on empyrical's beta calculations, but if that becomes a bottleneck this is probably a good place to look to speed things up.
|
||
def test_allowed_missing_doesnt_double_count(self): | ||
# Test that allowed_missing only counts a row as missing one | ||
# observation if it's missing in both the dependent and independent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should "both" be "either"? Dates for which the dependent or independent (or both) data is nan cannot be included in the beta calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wording is correct. This test is checking is that if a row has a NaN in both the dependent and independent, then we only count that row as adding 1 toward the number of required missing values.
@dmichalowicz yeah, I think this is ready for another pass if you can take another look. |
1 similar comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssanderson Left some more comments. My main question is with how we "nan" out the independent variable but don't do the opposite for the dependent
window_length=2, | ||
mask=(AssetExists() | SingleAsset(asset=target)), | ||
) | ||
allowed_missing_count = int(np.floor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need np.floor
? Doesn't int
already floor?
|
||
@expect_types( | ||
regression_length=int, | ||
target=Asset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nitpicky, but could you make these the same order as the function definition?
return "{}(window_length={}, allowed_missing={})".format( | ||
type(self).__name__, | ||
self.window_length, | ||
self.params['allowed_missing'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "allowed_missing_count"?
independent, | ||
) | ||
|
||
# Calculate beta as Cov(X, Y) / Cov(Y, Y). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean Cov(X, Y) / Cov(X, X) (The calculation below is correct, just this comment is off)
# shape: (M,) | ||
independent_variances = nanmean(ind_residual ** 2, axis=0) | ||
|
||
# shape: (M,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these shape comments!
tests/pipeline/test_statistical.py
Outdated
|
||
# Sanity check that we actually inserted some nans. | ||
self.assertTrue(np.count_nonzero(np.isnan(dependents)) > 0) | ||
self.assertTrue(np.count_nonzero(np.isnan(independent)) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a very small chance, but doesn't this mean this test could fail randomly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're seeding the rng, so this is still deterministic.
tests/pipeline/test_statistical.py
Outdated
assert_equal(np.isnan(result5), | ||
np.array([False, False, True, False, False])) | ||
|
||
# With six allowed missing values, everything should produce a value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With five allowed...
isnan(dependents), | ||
nan, | ||
independent, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you do a similar operation on the dependent
matrix? Otherwise it will be using rows in its residual calculation that the independent variable does not use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Centering dependent turns out not to matter. See 6bacafa for a lengthy description. (We discussed this in person, but putting the link here for posterity.)
The usual formula for covariance is:: mean((X - mean(X)) * (Y - mean(Y))) This is equivalent, however, to just doing: mean((X - mean(X)) * Y) Proof: Let X_res = (X - mean(X)). We have: mean(X_res * (Y - mean(Y))) = mean(X_res * (Y - mean(Y))) (1) = mean((X_res * Y) - (X_res * mean(Y))) (2) = mean(X_res * Y) - mean(X_res * mean(Y)) (3) = mean(X_res * Y) - mean(X_res) * mean(Y) (4) = mean(X_res * Y) - 0 * mean(Y) (5) = mean(X_res * Y) The tricky step in the above derivation is step (4). We know that mean(X_res) is zero because, for any X: mean(X - mean(X)) = mean(X) - mean(X) = 0.
- Fix incorrect variable names and comments. - Add test coverage for broken repr function.
This is a little over 1100x faster (!) than using
RollingLinearRegressionOfReturns
on my machine.Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:
compute
function that takes 95 seconds isRollingLinearRetressionOfReturns
.compute
function that takes 0.082 seconds is the new implementation.