New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass a mask (filter) to custom factors #1095

Merged
merged 4 commits into from Apr 8, 2016

Conversation

Projects
None yet
3 participants
@dmichalowicz
Contributor

dmichalowicz commented Mar 30, 2016

When a custom factor is passed a mask, it computes only over the assets for which the mask returns True. This allows computationally expensive factors to not have to compute over an entire universe of stocks.

@coveralls

This comment has been minimized.

coveralls commented Mar 30, 2016

Coverage Status

Coverage increased (+0.007%) to 88.868% when pulling 80cb51e on factor-mask into 488d638 on master.

@coveralls

This comment has been minimized.

coveralls commented Mar 30, 2016

Coverage Status

Coverage decreased (-0.04%) to 88.823% when pulling 80cb51e on factor-mask into 488d638 on master.

@coveralls

This comment has been minimized.

coveralls commented Mar 30, 2016

Coverage Status

Coverage increased (+0.003%) to 88.864% when pulling f750557 on factor-mask into 488d638 on master.

@dmichalowicz dmichalowicz force-pushed the factor-mask branch 2 times, most recently from 7a2bdea to 9934a76 Apr 4, 2016

@dmichalowicz dmichalowicz changed the title from WIP: Pass a mask (filter) to custom factors to Pass a mask (filter) to custom factors Apr 4, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 4, 2016

Coverage Status

Coverage increased (+0.004%) to 88.866% when pulling 9934a76 on factor-mask into 2fd0f2a on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 4, 2016

Coverage Status

Coverage increased (+0.004%) to 88.866% when pulling 9934a76 on factor-mask into 2fd0f2a on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 4, 2016

Coverage Status

Coverage increased (+0.004%) to 88.866% when pulling 9934a76 on factor-mask into 2fd0f2a on master.

@dmichalowicz dmichalowicz force-pushed the factor-mask branch from 09bfdb3 to d04d12e Apr 6, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 6, 2016

Coverage Status

Coverage increased (+0.01%) to 84.276% when pulling d04d12e on factor-mask into 4cc6048 on master.

@dmichalowicz

This comment has been minimized.

Contributor

dmichalowicz commented Apr 6, 2016

@ssanderson Ready for a look

# OpenPrice a mask, any assets that are filtered out should have all
# NaN values. Otherwise, we expect its computed values to be the
# asset's open price.
values = array([constants[open]] * num_dates, dtype=float)

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

you want to explicitly use float64 here, otherwise you'll get float32 on 32-bit systems.

# NaN values. Otherwise, we expect its computed values to be the
# asset's open price.
values = array([constants[open]] * num_dates, dtype=float)
missing_values = array([nan] * num_dates)

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

np.full(nan, num_dates)

values = array([constants[open]] * num_dates, dtype=float)
missing_values = array([nan] * num_dates)

for asset_id in asset_ids:

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

rather than run separate pipelines for each asset, could we just build all the terms we want and run one pipeline here?

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 7, 2016

Contributor

I've simplified the cases a bit to now only call run_pipeline twice; once with a single masked factor and once with two masked factors

factor1 = OpenPrice(mask=mask)

# Test running our pipeline both with and without a second factor.
# We do not explicitly test the resulting values of the second

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

Is there a reason not to test this?

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 7, 2016

Contributor

Not in particular, I've added this coverage

# factor; we just want to implicitly ensure that the addition of
# another factor to the pipeline term graph does not cause any
# unexpected exceptions when calling `run_pipeline`.
for factor2 in (None,

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

This is only ever testing with a mask that's constant w/r/t each asset. We should probably test with a filter that sometimes returns True and other times returns False.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 7, 2016

Contributor

Good point, I've added this in.

return workspace[mask][offset:], dates[offset:]
mask_offset = graph.extra_rows[mask] - graph.extra_rows[term]
dates_offset = (
graph.extra_rows[self._root_mask_term] - graph.extra_rows[term]

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

Might be worth commenting here that we're comparing against _root_mask_term because that's what determines the shape of the top-level dates array.

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

Also might be worth renaming the dates parameter to something like all_dates to be clear that we're slicing of the portion that's actually relevant to the particular term we're computing.

self._add_to_graph(
term.mask,
parents,
extra_rows=0,

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016

Member

I think this actually wants to be extra_rows here, not 0. If we need 3 extra rows of a ComputableTerm, then we also need 3 extra rows of the mask. What we don't need is an additional term.extra_input_rows of the mask.

@dmichalowicz dmichalowicz force-pushed the factor-mask branch 2 times, most recently from 9308239 to f0641aa Apr 7, 2016

@dmichalowicz

This comment has been minimized.

Contributor

dmichalowicz commented Apr 7, 2016

@ssanderson I merged your changes and updated based on your comments.

# Day 2 True True False False
# Day 3 True False False False
#
mask = AssetIDPlusDay() < asset_ids[-1] + dates[0].day

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

Might help readability here to put parens around the RHS of the < expression.

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

This isn't hitting the cases where an asset transitions from False -> True.

factor2_values = full(num_dates,
3.0 * (constants[open] - constants[close]))

def create_expected_results(values):

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

Clearer here might be to just compute the expected mask and then use np.where(mask, factor_values, nan).

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

This would also make it easier to test with more complex masks. See note below.

# adding another factor to the pipeline with a different window length
# does not cause any unexpected behavior, especially when both factors
# share the same mask.
columns['factor2'] = RollingSumDifference(mask=mask)

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

Can we also test that adding the second factor doesn't affect the result of the first factor?

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 7, 2016

Contributor

Ah yes meant to do that, good catch

"""
Load mask and mask row labels for term.
"""
mask = term.mask
mask_offset = graph.extra_rows[mask] - graph.extra_rows[term]

# This offset is computed against _root_mask_term because that is what

This comment has been minimized.

@ssanderson
@@ -8,13 +8,15 @@

from nose_parameterized import parameterized
from numpy import (
append,

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

FWIW, the already-imported np.concatenate does the same thing that you're accomplishing with append, except it takes a list of arrays instead of two positional args.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Apr 7, 2016

Travis build is failing.

@dmichalowicz dmichalowicz force-pushed the factor-mask branch from f0641aa to 0d6d524 Apr 7, 2016

columns = {'factor1': OpenPrice(mask=mask), 'mask': mask}
pipeline = Pipeline(columns=columns)
results = engine.run_pipeline(pipeline, dates[0], dates[-1])
mask_results = results['mask'].unstack()

This comment has been minimized.

@ssanderson

ssanderson Apr 7, 2016

Member

Can we assert that the masks are actually as expected here?

@ssanderson

This comment has been minimized.

Member

ssanderson commented Apr 7, 2016

LGTM; merge on green from Travis.

@dmichalowicz dmichalowicz force-pushed the factor-mask branch from 1e3eaff to 4449f28 Apr 7, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 7, 2016

Coverage Status

Coverage decreased (-0.03%) to 84.236% when pulling 4449f28 on factor-mask into c5e15f2 on master.

@ssanderson ssanderson merged commit 675eff4 into master Apr 8, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the factor-mask branch Apr 8, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment