Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Add a new factor that just computes beta. #2021

Merged
merged 10 commits into from Nov 27, 2017
Merged

Conversation

ssanderson
Copy link
Contributor

@ssanderson ssanderson commented Nov 21, 2017

This is a little over 1100x faster (!) than using
RollingLinearRegressionOfReturns on my machine.

Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:

Tue Nov 21 02:05:48 2017    pipebench/perf/betas.stats

         57724856 function calls (57689155 primitive calls) in 92.342 seconds

   Ordered by: cumulative time
   List reduced from 1212 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.612    0.029   95.461    4.546 statistical.py:194(compute)
   172201    0.407    0.000   94.843    0.001 statistical.py:201(regress)
       21    0.048    0.002    0.082    0.004 statistical.py:500(compute)
  • The compute function that takes 95 seconds is RollingLinearRetressionOfReturns.
  • The compute function that takes 0.082 seconds is the new implementation.

@ssanderson
Copy link
Contributor Author

I think the only significant decision that needs to be made here is what we want to do if there are NaNs in the input. The current implementation uses nanmean everywhere possible, which means it produces values more often than the existing RollingLinearRegressionOfReturns, (which internally calls scipy.stats.linregress, which is, as it turns out, not particularly efficient when called 172,000 times).

cc @Jstauth @jmccorriston @ahgnaw @twiecki for thoughts on how we want to handle missing data here. Note that I'm not trying to port our downstream shrinkage beta (I think we should also do that, but the use-case here is to just provide the simplest and fastest reasonable implementation of beta, in particular with an eye toward use in portfolio optimization.)

This is a little over 1100x faster (!) than using
`RollingLinearRegressionOfReturns` on my machine.

Profiling output for a 1-month pipeline using both terms for with a
90-day lookback:

```
Tue Nov 21 02:05:48 2017    pipebench/perf/betas.stats

         57724856 function calls (57689155 primitive calls) in 92.342 seconds

   Ordered by: cumulative time
   List reduced from 1212 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       21    0.612    0.029   95.461    4.546 statistical.py:194(compute)
   172201    0.407    0.000   94.843    0.001 statistical.py:201(regress)
       21    0.048    0.002    0.082    0.004 statistical.py:500(compute)
```
@@ -148,11 +155,6 @@ class RollingLinearRegression(CustomFactor, SingleInputMixin):
The factor/slice whose columns are the predictor/independent variable
of each regression with `dependent`. If `independent` is a Factor,
regressions are computed asset-wise.
independent : zipline.pipeline.Term with a numeric dtype
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change to the rest of the PR. We had this documented twice.

@ssanderson
Copy link
Contributor Author

Hilarious performance note: in the current implementation more than half of the total time is spent constructing namedtuple types:

Function                                called...
                                            ncalls  tottime  cumtime
_stats_mstats_common.py:11(linregress)  ->  172201    3.918   15.788  _distn_infrastructure.py:1720(sf)
                                            172201   44.965   58.895  collections.py:305(namedtuple)
                                            344402    0.445    5.133  fromnumeric.py:2796(mean)
                                            172201    2.672   10.022  function_base.py:2298(cov)
                                            344402    0.209    0.272  numeric.py:414(asarray)
                                            172201    0.028    0.028  {len}

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.01%) to 87.353% when pulling 77cb3fd on faster-beta into 079ea3d on master.

@analicia
Copy link
Contributor

In empyrical.beta_aligned() we remove any date-pairs with np.nan in them. Here we might want to add a minimum number of non-nan date pairs required to calculate beta.

@jmccorriston
Copy link
Contributor

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

@ssanderson
Copy link
Contributor Author

What is the current behavior of RollingLinearRegressionOfReturns with regards to nan values?

Any asset with a NaN at any point in the lookback window ends up NaN.

Throw out observations that have nans in either array in
``SimpleBeta``.
Copy link
Contributor

@dmichalowicz dmichalowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssanderson Pending the TODO on missing data, this looks good to me.

class SimpleBeta(CustomFactor, StandardOutputs):
"""
Factor producing the slope of a regression line between each asset's daily
returns the daily returns of a single "target" asset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...each asset's daily returns and the daily returns of..." (missing "and")

@ssanderson
Copy link
Contributor Author

ssanderson commented Nov 21, 2017

Pushed a change that makes the behavior here match empyrical exactly around missing data. Handling NaNs imposes around a 2.5x performance penalty (from 0.9 seconds per year to 2.2 on my machine). That's still a huge win compared to 90+ seconds per month with the old implementation. We could probably claw back a good chunk of that performance by dropping into Cython for the nan-aware handling, but I think this is probably a good first cut if we're happy with empyrical's nan-handling behavior, which is to drop any observation in the regression where either the independent or dependent variable has missing data.

Before:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    3.544    3.544 engine.py:445(compute_chunk)
        2    0.027    0.013    1.469    0.735 mixins.py:189(_compute)
      253    0.518    0.002    0.917    0.004 statistical.py:512(compute)

After:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    4.909    4.909 engine.py:445(compute_chunk)
        2    0.028    0.014    2.844    1.422 mixins.py:189(_compute)
      253    0.003    0.000    2.252    0.009 statistical.py:515(compute)

@ssanderson
Copy link
Contributor Author

ssanderson commented Nov 21, 2017

Hrm, though, thinking about this a bit more, I think we need to require a minimum number of data points if we're handling nans in this way (as @ahgnaw suggests above). If we don't, we'll produce crazy values for regressions where we only have 2-3 data points.

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+0.02%) to 87.356% when pulling 0c66d72 on faster-beta into 079ea3d on master.

@jmccorriston
Copy link
Contributor

Oh, that's not great. I like Ana's idea of a minimum number or minimum percentage of non-nan date pairs.

Make `SimpleBeta` produce values when there are missing returns
observations.

By default, we allow up to 25% of the returns observations to be
missing, so that we can start producing betas earlier for recently
IPO-ed stocks.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 87.296% when pulling 8c8df85 on faster-beta into 079ea3d on master.

Asset type has a different repr in py3.
@coveralls
Copy link

coveralls commented Nov 22, 2017

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 71789ff on faster-beta into 079ea3d on master.

@dmichalowicz
Copy link
Contributor

@ssanderson Let me know if this needs another round of review

)


def vectorized_beta(dependents, independent, allowed_missing, out=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find another place where vectorized_beta might be useful at the moment, however at a later date we might want to move it to empyrical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think empyrical would use the vectorized part of this anywhere right now, but I suspect that the direct calculation of covariance and variance here is faster than what we currently do in empyrical, because it avoids making copies of the data and it avoids calculating parts of the covariance matrix that we don't need. I'm not super worried about porting this in the near term, because I don't think anything is currently bottlenecked on empyrical's beta calculations, but if that becomes a bottleneck this is probably a good place to look to speed things up.


def test_allowed_missing_doesnt_double_count(self):
# Test that allowed_missing only counts a row as missing one
# observation if it's missing in both the dependent and independent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "both" be "either"? Dates for which the dependent or independent (or both) data is nan cannot be included in the beta calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wording is correct. This test is checking is that if a row has a NaN in both the dependent and independent, then we only count that row as adding 1 toward the number of required missing values.

@ssanderson
Copy link
Contributor Author

@dmichalowicz yeah, I think this is ready for another pass if you can take another look.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 87.36% when pulling 92275ad on faster-beta into 079ea3d on master.

Copy link
Contributor

@dmichalowicz dmichalowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssanderson Left some more comments. My main question is with how we "nan" out the independent variable but don't do the opposite for the dependent

window_length=2,
mask=(AssetExists() | SingleAsset(asset=target)),
)
allowed_missing_count = int(np.floor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need np.floor? Doesn't int already floor?


@expect_types(
regression_length=int,
target=Asset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpicky, but could you make these the same order as the function definition?

return "{}(window_length={}, allowed_missing={})".format(
type(self).__name__,
self.window_length,
self.params['allowed_missing'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "allowed_missing_count"?

independent,
)

# Calculate beta as Cov(X, Y) / Cov(Y, Y).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean Cov(X, Y) / Cov(X, X) (The calculation below is correct, just this comment is off)

# shape: (M,)
independent_variances = nanmean(ind_residual ** 2, axis=0)

# shape: (M,)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these shape comments!


# Sanity check that we actually inserted some nans.
self.assertTrue(np.count_nonzero(np.isnan(dependents)) > 0)
self.assertTrue(np.count_nonzero(np.isnan(independent)) > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very small chance, but doesn't this mean this test could fail randomly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're seeding the rng, so this is still deterministic.

assert_equal(np.isnan(result5),
np.array([False, False, True, False, False]))

# With six allowed missing values, everything should produce a value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With five allowed...

isnan(dependents),
nan,
independent,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you do a similar operation on the dependent matrix? Otherwise it will be using rows in its residual calculation that the independent variable does not use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centering dependent turns out not to matter. See 6bacafa for a lengthy description. (We discussed this in person, but putting the link here for posterity.)

The usual formula for covariance is::

   mean((X - mean(X)) * (Y - mean(Y)))

This is equivalent, however, to just doing:

   mean((X - mean(X)) * Y)

Proof:

    Let X_res = (X - mean(X)).
    We have:

    mean(X_res * (Y - mean(Y))) = mean(X_res * (Y - mean(Y)))
                            (1) = mean((X_res * Y) - (X_res * mean(Y)))
                            (2) = mean(X_res * Y) - mean(X_res * mean(Y))
                            (3) = mean(X_res * Y) - mean(X_res) * mean(Y)
                            (4) = mean(X_res * Y) - 0 * mean(Y)
                            (5) = mean(X_res * Y)

The tricky step in the above derivation is step (4). We know that
mean(X_res) is zero because, for any X:

    mean(X - mean(X)) = mean(X) - mean(X) = 0.
@coveralls
Copy link

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

@coveralls
Copy link

coveralls commented Nov 27, 2017

Coverage Status

Coverage increased (+0.03%) to 87.371% when pulling 4735f3b on faster-beta into 079ea3d on master.

@ssanderson ssanderson merged commit a4539f4 into master Nov 27, 2017
@ssanderson ssanderson deleted the faster-beta branch November 27, 2017 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants