PERF: Speedup RollingPearson #2071

ssanderson · 2018-01-02T01:53:08Z

Use the same techniques used in SimpleBeta to re-implement
RollingPearson (and transitively, RollingPearsonOfReturns).

For a 1 year window length, this provides about a 60x speedup on my
machine:

pipebench/perf/pearson.stats% stats statistical.py
Mon Jan  1 20:00:31 2018    pipebench/perf/pearson.stats

         9326197 function calls (9325541 primitive calls) in 8.407 seconds

   Ordered by: cumulative time
   List reduced from 766 to 3 due to restriction <'statistical.py'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       # Old implementation.
       21    0.243    0.012    7.130    0.340 statistical.py:92(compute)
       # New implementation.
       21    0.000    0.000    0.253    0.012 statistical.py:105(compute)
       21    0.170    0.008    0.253    0.012 statistical.py:685(vectorized_pearson_r)

The new vectorized_pearson_r also has the same support for missing
data that was implemented in vectorized_beta, but I haven't pushed that
to logic up to RollingPearson yet. I added a fast path for the allowed_missing=0
case, since we can be significantly faster there, and it's likely to stay the default
for backwards compatibility reasons.

coveralls · 2018-01-02T05:29:20Z

Coverage increased (+0.01%) to 87.591% when pulling fab5c0d on speedup-pearson into 5978f68 on master.

Use the same techniques used in `SimpleBeta` to re-implement RollingPearson (and RollingPearsonOfReturns, etc.). For a 1-month window length, this provides about a 30x speedup on my machine: ``` pipebench/perf/pearson.stats% stats statistical.py Mon Jan 1 20:00:31 2018 pipebench/perf/pearson.stats 9326197 function calls (9325541 primitive calls) in 8.407 seconds Ordered by: cumulative time List reduced from 766 to 3 due to restriction <'statistical.py'> ncalls tottime percall cumtime percall filename:lineno(function) # Old implementation. 21 0.243 0.012 7.130 0.340 statistical.py:92(compute) # New implementation. 21 0.000 0.000 0.253 0.012 statistical.py:105(compute) 21 0.170 0.008 0.253 0.012 statistical.py:685(vectorized_pearson_r) ``` The new `vectorized_pearson_r` also has the same support for missing data that was implemented in `vectorized_beta`, but I haven't pushed that to logic up to `RollingPearsonR` yet. For backwards compatibility, the default for RollingPearson probably needs to stay at 0% allowed missing.

coveralls · 2018-01-02T17:58:07Z

Coverage increased (+0.01%) to 87.591% when pulling cb61ba9 on speedup-pearson into 5978f68 on master.

freddiev4

This is cool! Had some questions for you below 😃

freddiev4 · 2018-01-10T22:42:52Z

zipline/pipeline/factors/statistical.py

+
+    evaluate(
+        'where(mask, nan, cov / sqrt(ind_variance * dep_variance))',
+        local_dict={'cov': covariances,


In regards to style, why not put the first kv pair on the next line and then align the following kv pairs with that?

local_dict={ 'cov': covariances 'mask': isnan(independents).sum(axis=0) > allowed_missing, 'nan': np.nan, 'ind_variance': ind_variance, 'dep_variance': dep_variance }

freddiev4 · 2018-01-10T22:46:45Z

zipline/pipeline/factors/statistical.py

+    :class:`zipline.pipeline.factors.RollingPearsonOfReturns`
+    """
+    nan = np.nan
+    isnan = np.isnan


Why do we assign the values from numpy to variables instead of just writing np.nan and np.isnan? Would doing the latter affect performance at all (I can't really imagine that being the case)?

freddiev4 · 2018-01-10T22:50:10Z

zipline/pipeline/factors/statistical.py

+                    'nan': np.nan,
+                    'ind_variance': ind_variance,
+                    'dep_variance': dep_variance},
+        global_dict={},


What is the purpose of this global_dict param? And why do we want it to be an empty dict?

freddiev4 · 2018-01-10T23:38:48Z

tests/pipeline/test_statistical.py

+    def naive_columnwise_pearson(self, left, right):
+        return self.naive_columnwise_func(pearsonr, left, right)
+
+    def naive_columnwise_spearman(self, left, right):


Is this used anywhere?

freddiev4 · 2018-01-10T23:39:23Z

tests/pipeline/test_statistical.py

+        # We should get the same result from passing a N x 1 array or an N x 3
+        # array with the column tiled 3 times.
+        do_check(_independent)
+        do_check(np.tile(_independent, 3))


TIL about np.tile(). That's cool 🙂

freddiev4 · 2018-01-10T23:46:42Z

tests/pipeline/test_statistical.py

+                                                     seed,
+                                                     nans,
+                                                     nan_offset):
+        rand = np.random.RandomState(seed)


What's the advantage of using this over np.random.seed() and then np.random.uniform()? Looking at the other test cases it seems like they all use this as well.

np.random.seed sets the global random state's seed, which is mutating shared global state. This creates a new random state which is isolated from other tests or callers.

h55nick · 2018-05-08T10:39:31Z

would be great to get this locked in! Use the RollingPearson factor a bunch

aurtistictrader · 2018-06-15T14:15:21Z

Would it be worth it to get the RollingSpearman stuff too? I find it hard to use that as well.

ssanderson changed the title ~~Speedup pearson~~ Speedup RollingPearson Jan 2, 2018

Scott Sanderson added 4 commits January 2, 2018 10:11

DOC: Clarify docstring for vectorized_beta.

69a4c38

STY: Flake8 fixes.

c355394

DOC: Fix incorrect formula in comment.

cb61ba9

ssanderson force-pushed the speedup-pearson branch from fab5c0d to cb61ba9 Compare January 2, 2018 17:34

freddiev4 reviewed Jan 10, 2018

View reviewed changes

freddiev4 changed the title ~~Speedup RollingPearson~~ PERF: Speedup RollingPearson Jan 10, 2018

ssanderson merged commit 7eeaafb into master Jan 21, 2020

ssanderson deleted the speedup-pearson branch January 21, 2020 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Speedup RollingPearson #2071

PERF: Speedup RollingPearson #2071

ssanderson commented Jan 2, 2018 •

edited

Loading

coveralls commented Jan 2, 2018 •

edited

Loading

coveralls commented Jan 2, 2018 •

edited

Loading

freddiev4 left a comment

freddiev4 Jan 10, 2018

freddiev4 Jan 10, 2018 •

edited

Loading

freddiev4 Jan 10, 2018

freddiev4 Jan 10, 2018

freddiev4 Jan 10, 2018

freddiev4 Jan 10, 2018

llllllllll Jun 15, 2018

h55nick commented May 8, 2018

aurtistictrader commented Jun 15, 2018

PERF: Speedup RollingPearson #2071

PERF: Speedup RollingPearson #2071

Conversation

ssanderson commented Jan 2, 2018 • edited Loading

coveralls commented Jan 2, 2018 • edited Loading

coveralls commented Jan 2, 2018 • edited Loading

freddiev4 left a comment

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018 • edited Loading

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

freddiev4 Jan 10, 2018

Choose a reason for hiding this comment

llllllllll Jun 15, 2018

Choose a reason for hiding this comment

h55nick commented May 8, 2018

aurtistictrader commented Jun 15, 2018

ssanderson commented Jan 2, 2018 •

edited

Loading

coveralls commented Jan 2, 2018 •

edited

Loading

coveralls commented Jan 2, 2018 •

edited

Loading

freddiev4 Jan 10, 2018 •

edited

Loading