Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Speedup RollingPearson #2071

Merged
merged 4 commits into from
Jan 21, 2020
Merged

PERF: Speedup RollingPearson #2071

merged 4 commits into from
Jan 21, 2020

Conversation

ssanderson
Copy link
Contributor

@ssanderson ssanderson commented Jan 2, 2018

Use the same techniques used in SimpleBeta to re-implement
RollingPearson (and transitively, RollingPearsonOfReturns).

For a 1 year window length, this provides about a 60x speedup on my
machine:

pipebench/perf/pearson.stats% stats statistical.py
Mon Jan  1 20:00:31 2018    pipebench/perf/pearson.stats

         9326197 function calls (9325541 primitive calls) in 8.407 seconds

   Ordered by: cumulative time
   List reduced from 766 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       # Old implementation.
       21    0.243    0.012    7.130    0.340 statistical.py:92(compute)
       # New implementation.
       21    0.000    0.000    0.253    0.012 statistical.py:105(compute)
       21    0.170    0.008    0.253    0.012 statistical.py:685(vectorized_pearson_r)

The new vectorized_pearson_r also has the same support for missing
data that was implemented in vectorized_beta, but I haven't pushed that
to logic up to RollingPearson yet. I added a fast path for the allowed_missing=0
case, since we can be significantly faster there, and it's likely to stay the default
for backwards compatibility reasons.

@ssanderson ssanderson changed the title Speedup pearson Speedup RollingPearson Jan 2, 2018
@coveralls
Copy link

coveralls commented Jan 2, 2018

Coverage Status

Coverage increased (+0.01%) to 87.591% when pulling fab5c0d on speedup-pearson into 5978f68 on master.

Scott Sanderson added 4 commits January 2, 2018 10:11
Use the same techniques used in `SimpleBeta` to re-implement
RollingPearson (and RollingPearsonOfReturns, etc.).

For a 1-month window length, this provides about a 30x speedup on my
machine:

```
pipebench/perf/pearson.stats% stats statistical.py
Mon Jan  1 20:00:31 2018    pipebench/perf/pearson.stats

         9326197 function calls (9325541 primitive calls) in 8.407 seconds

   Ordered by: cumulative time
   List reduced from 766 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       # Old implementation.
       21    0.243    0.012    7.130    0.340 statistical.py:92(compute)
       # New implementation.
       21    0.000    0.000    0.253    0.012 statistical.py:105(compute)
       21    0.170    0.008    0.253    0.012 statistical.py:685(vectorized_pearson_r)
```

The new `vectorized_pearson_r` also has the same support for missing
data that was implemented in `vectorized_beta`, but I haven't pushed that
to logic up to `RollingPearsonR` yet.  For backwards compatibility, the
default for RollingPearson probably needs to stay at 0% allowed missing.
@coveralls
Copy link

coveralls commented Jan 2, 2018

Coverage Status

Coverage increased (+0.01%) to 87.591% when pulling cb61ba9 on speedup-pearson into 5978f68 on master.

Copy link
Contributor

@freddiev4 freddiev4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Had some questions for you below 😃


evaluate(
'where(mask, nan, cov / sqrt(ind_variance * dep_variance))',
local_dict={'cov': covariances,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to style, why not put the first kv pair on the next line and then align the following kv pairs with that?

local_dict={
  'cov': covariances
  'mask': isnan(independents).sum(axis=0) > allowed_missing,
  'nan': np.nan,
  'ind_variance': ind_variance,
  'dep_variance': dep_variance
}

:class:`zipline.pipeline.factors.RollingPearsonOfReturns`
"""
nan = np.nan
isnan = np.isnan
Copy link
Contributor

@freddiev4 freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we assign the values from numpy to variables instead of just writing np.nan and np.isnan? Would doing the latter affect performance at all (I can't really imagine that being the case)?

'nan': np.nan,
'ind_variance': ind_variance,
'dep_variance': dep_variance},
global_dict={},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this global_dict param? And why do we want it to be an empty dict?

def naive_columnwise_pearson(self, left, right):
return self.naive_columnwise_func(pearsonr, left, right)

def naive_columnwise_spearman(self, left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

# We should get the same result from passing a N x 1 array or an N x 3
# array with the column tiled 3 times.
do_check(_independent)
do_check(np.tile(_independent, 3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about np.tile(). That's cool 🙂

seed,
nans,
nan_offset):
rand = np.random.RandomState(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of using this over np.random.seed() and then np.random.uniform()? Looking at the other test cases it seems like they all use this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.random.seed sets the global random state's seed, which is mutating shared global state. This creates a new random state which is isolated from other tests or callers.

@freddiev4 freddiev4 changed the title Speedup RollingPearson PERF: Speedup RollingPearson Jan 10, 2018
@h55nick
Copy link

h55nick commented May 8, 2018

would be great to get this locked in! Use the RollingPearson factor a bunch

@aurtistictrader
Copy link

Would it be worth it to get the RollingSpearman stuff too? I find it hard to use that as well.

@ssanderson ssanderson merged commit 7eeaafb into master Jan 21, 2020
@ssanderson ssanderson deleted the speedup-pearson branch January 21, 2020 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants