New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline single column capabilities #1267

Merged
merged 1 commit into from Jun 23, 2016

Conversation

Projects
None yet
4 participants
@dmichalowicz
Contributor

dmichalowicz commented Jun 7, 2016

  • Adds a new Slice class, which is a term that returns a single column of a Factor/Filter/Classifer. Slices can be created by indexing into a Factor/Filter/Classifer with the desired asset column, e.g. Returns(...)[Equity(24)]
  • Allow custom factors to compute single-column outputs by specifying ndim = 1; i.e. out is a single column (and at every iteration of compute is a singleton). However, "1D" output factors cannot currently be added as pipeline columns.

@dmichalowicz dmichalowicz force-pushed the pipeline-single-columns branch from 54ef050 to cedab3e Jun 7, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 8, 2016

Coverage Status

Coverage increased (+0.08%) to 82.829% when pulling 5b798bd on pipeline-single-columns into 7879065 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 9, 2016

Coverage Status

Coverage increased (+0.07%) to 82.817% when pulling 811c469 on pipeline-single-columns into 7879065 on master.

@dmichalowicz dmichalowicz changed the title from WIP: Pipeline single column inputs to Pipeline single column inputs Jun 9, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 9, 2016

Coverage Status

Coverage increased (+0.07%) to 82.821% when pulling b9d4b2d on pipeline-single-columns into 7879065 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 9, 2016

Coverage Status

Coverage increased (+0.07%) to 82.819% when pulling 266604f on pipeline-single-columns into 7879065 on master.

parent_term=type(self.inputs[0]).__name__,
asset=self._asset,
)

This comment has been minimized.

@ssanderson

ssanderson Jun 10, 2016

Member

I think that rather than using a Mixin here, we might just want to have Slice be a direct subclass of ComputableTerm.

Rationale:

  • If we're using it as a mixin, then we have to have separate FactorSlice, FilterSlice and ClassifierSlice types. If it's just a computable term, then we only have one class.
  • Using slice as a mixin means that FactorSlice and friends inherit all the methods from Factor, but most of them don't work on Slice.
asset=asset,
inputs=[term],
window_length=0,
mask=term.mask,

This comment has been minimized.

@ssanderson

ssanderson Jun 10, 2016

Member

I wonder if we should do term.mask[asset] here to keep the mask dims in sync with the term dims?

@dmichalowicz dmichalowicz force-pushed the pipeline-single-columns branch from b3c316b to 510b9ea Jun 13, 2016

AssetID,
AssetIDPlusDay,
OpenPrice,
)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 13, 2016

Contributor

@ssanderson Is it bad practice to import from another test file?

This comment has been minimized.

@ssanderson

ssanderson Jun 13, 2016

Member

Yes.

On Mon, Jun 13, 2016 at 3:03 PM, David Michalowicz <notifications@github.com

wrote:

In tests/pipeline/test_slice.py
#1267 (comment):

  • Returns,
  • RollingLinearRegressionOfReturns,
  • RollingPearsonOfReturns,
  • RollingSpearmanOfReturns,
  • SimpleMovingAverage,
    +)
    +from zipline.pipeline.loaders.frame import DataFrameLoader
    +from zipline.pipeline.slice import Slice
    +from zipline.testing import check_arrays, parameter_space
    +from zipline.testing.fixtures import WithTradingEnvironment, ZiplineTestCase

+from .test_engine import (

  • AssetID,
  • AssetIDPlusDay,
  • OpenPrice,
    +)

@ssanderson https://github.com/ssanderson Is it bad practice to import
from another test file?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/quantopian/zipline/pull/1267/files/820af991752aaedcd8093e0ad803fba534c963d9#r66848199,
or mute the thread
https://github.com/notifications/unsubscribe/ABg8hYMxwNBaW0VG_dOTofs1EqAyReygks5qLamMgaJpZM4IwHT3
.

@coveralls

This comment has been minimized.

coveralls commented Jun 13, 2016

Coverage Status

Coverage increased (+0.1%) to 79.899% when pulling 2cd7290 on pipeline-single-columns into c83414e on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 14, 2016

Coverage Status

Coverage increased (+0.1%) to 79.894% when pulling c175107 on pipeline-single-columns into c83414e on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 14, 2016

Coverage Status

Coverage increased (+0.1%) to 79.9% when pulling eb9c227 on pipeline-single-columns into c83414e on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 14, 2016

Coverage Status

Coverage increased (+0.1%) to 79.9% when pulling 900110b on pipeline-single-columns into c83414e on master.

@dmichalowicz dmichalowicz force-pushed the pipeline-single-columns branch from 900110b to 3f51030 Jun 15, 2016

@dmichalowicz dmichalowicz changed the title from Pipeline single column inputs to Pipeline single column capabilities Jun 15, 2016

if assets[asset_column] != asset.sid:
raise NonExistentAssetInTimeFrame(
asset=asset, start_date=dates[0], end_date=dates[-1],
)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 15, 2016

Contributor

I added a check here to make sure we don't quietly return the wrong column of data for a non existent asset

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

👍 do we have test coverage for this?

asset=asset,
inputs=[term],
window_length=0,
mask=term.mask[asset] if term.mask else AssetExists(),

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 15, 2016

Contributor

Since we are taking the slice of our term's mask, this triggers a recursive call to __new__, so the else is the base case of when term is AssetExists and term.mask is None. Is this the correct way to handle this?

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

I think this should just be term.mask.

@dmichalowicz

This comment has been minimized.

Contributor

dmichalowicz commented Jun 15, 2016

@ssanderson I just made a few comments and touch-ups, but should be ready for another look.

@coveralls

This comment has been minimized.

coveralls commented Jun 15, 2016

Coverage Status

Coverage increased (+0.1%) to 79.897% when pulling 5c87cc7 on pipeline-single-columns into 4c2f0e8 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 15, 2016

Coverage Status

Coverage increased (+0.1%) to 79.905% when pulling 5c87cc7 on pipeline-single-columns into 4c2f0e8 on master.

@dmichalowicz dmichalowicz force-pushed the pipeline-single-columns branch from 5c87cc7 to a56f964 Jun 15, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 15, 2016

Coverage Status

Coverage increased (+0.1%) to 79.906% when pulling a56f964 on pipeline-single-columns into 4c2f0e8 on master.

@@ -17,6 +17,10 @@ Enhancements
- Added support for non-float columns to Blaze-backed Pipeline
datasets (:issue:`1201`).
- Added :class:`zipline.pipeline.slice.Slice`, a new pipeline term designed to

This comment has been minimized.

@ssanderson
@@ -1258,19 +1214,39 @@ def test_correlation_factors(self, returns_length, correlation_length):
`RollingSpearmanOfReturns`.
"""
my_asset_column = 0
start_date_index = 6
end_date_index = 10
start_date_index = 14

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Why did this change?

def test_correlation_and_regression_with_bad_asset(self):
"""
Test that `RollingPearsonOfReturns`, `RollingSpearmanOfReturns` and
`RollingLinearRegressionOfReturns` raise the proper exception when
given a nonexistent target asset.
"""
start_date_index = 6
end_date_index = 10
start_date_index = 14

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Same question. I don't see what else changed in these tests to necessitate a different date range.

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

I changed this so that I could run the pipeline over a contiguous set of 5 days (i.e. with no weekend) while also having plenty of days to look back on. It was previously starting at index 6 only because that was the first day that gave enough days to look back on (specifically, you need at least as many days as the window_length of returns plus the correlation/regression length). I made this change above as well for consistency and it also simplified things when computing the expected masks to have a contiguous 5 days.

Test that slices can be created by indexing into a term, and that they
have the correct shape when used as inputs.
"""
my_asset_column = 0

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Might be worth parameterizing this over my_asset_column and window_length.

my_asset = self.asset_finder.retrieve_asset(self.sids[0])
open_slice = OpenPrice()[my_asset]
with self.assertRaises(UnsupportedPipelineColumn):

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

I might call this UnsupportedPipelineTerm rather than column. I'm not sure how we want to refer to slices.

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

(This also matches NonSliceableTerm.)

)
# Slices of Returns *are* window safe.
returns = Returns(window_length=2)

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

For the purpose of this test, I think we care more about the fact that the term here is window_safe than that it's actually Returns. I would probably make a new MySafeFactor here with window_safe=True and test that.

@@ -666,3 +666,21 @@ class ScheduleFunctionWithoutCalendar(ZiplineError):
"To use schedule_function, the TradingAlgorithm must be running on an "
"ExchangeTradingSchedule, rather than {schedule}."
)
class UnsupportedPipelineColumn(ZiplineError):

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Another option for the name here might be UnsupportedPipelineOutput.

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

I think I like this slightly better than UnsupportedPipelineTerm

@@ -361,7 +361,6 @@ def compute_chunk(self, graph, dates, assets, initial_workspace):
assets,
mask,
)
assert(workspace[term].shape == mask.shape)

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

We probably should still make assertions here. How about this?

if term.ndim == 2:
    <original assertion>
else:
    assert workspace[term].shape == (mask.shape[0], 1)
A new Factor that will compute correlations between `target_slice`
and the columns of `self`.
"""
return RollingPearson(

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Can we make assertions about the dtypes of self and target_slice here? (Or are those assertions made in RollingPearson?)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

So something to make sure we can compute correlations, like this?

assert self.dtype in {float64_dtype, int64_dtype}
assert target.dtype in {float64_dtype, int64_dtype}

As in, we can't do it on datetimes

This comment has been minimized.

@ssanderson

ssanderson Jun 17, 2016

Member

I wouldn't actually use assert here, but we should fail at construction time on bad input.

correlation_length,
mask=NotSpecified):
"""
Construct a new Factor that computes rolling spearman correlation

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

I would probably add "spearman rank correlation".

A new Factor that will compute linear regressions of `target_slice`
against the columns of `self`.
"""
return RollingLinearRegression(

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Same question as above re: type checking.

correlation_length=int,
mask=(Filter, NotSpecifiedType),
)
def rolling_spearmanr(self,

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

I'm not sure we need the rolling_ prefix on these. cc @twiecki @jmccorriston @CaptainKanuk for naming thoughts.

This comment has been minimized.

@CaptainKanuk

CaptainKanuk Jun 17, 2016

Contributor

I'm not convinced the prefix is necessary, pipeline as a whole is rolling so I guess it's implicit that any statistical metric would also be rolling. I would prefer very helpful docstrings on the objects.

mask=(Filter, NotSpecifiedType),
)
def rolling_pearsonr(self,
target_slice,

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

target_slice feels a little verbose here. Do you think target is too vague?

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

No I suppose target would be ok, since the target_factor here is just self, so there's no ambiguity as to what target is.

@@ -0,0 +1,270 @@

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

We should probably move the base correlation and regression factors here as well. (I think that means you'll need to do a lazy import in the factor methods that emit those types, which is okay here.)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

Done

@@ -500,6 +525,64 @@ def __repr__(self):
)
class Slice(ComputableTerm):

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

This probably wants SingleInputMixin.

# Return a 2D array with one column rather than a 1D array of the
# column.
col = windows[0][:, asset_column]
return col[:, None]

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

We can do this in one step with windows[0][:, [asset_column]].

@@ -1156,6 +1158,72 @@ def create_empty_splits_mergers_frame():
)
def make_alternating_boolean_array(shape, first=True):

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

I might call this first_value for clarity. Seeing first=True makes me think it's a flag.

asset.
"""
window_length = 1
# HACK: We currently decide whether to load or compute a Term based on the

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

This isn't true anymore. We should be able to do inputs=() and remove close from the compute signature now.

out[:] = assets
class AssetIDPlusDay(CustomFactor):

This comment has been minimized.

@ssanderson

ssanderson Jun 16, 2016

Member

Same deal here as for AssetID.

from zipline.utils.numpy_utils import datetime64ns_dtype
class SliceTestCase(WithSeededRandomPipelineEngine, ZiplineTestCase):

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

I changed this to use WithSeededRandomPipelineEngine

class _RollingCorrelation(CustomFactor, SingleInputMixin):
@expect_dtypes(target_factor=ALLOWED_DTYPES, target_slice=ALLOWED_DTYPES)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

Am now doing type checking here

"""
outputs = ['alpha', 'beta', 'r_value', 'p_value', 'stderr']
@expect_dtypes(target_factor=ALLOWED_DTYPES, target_slice=ALLOWED_DTYPES)

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 17, 2016

Contributor

And here

super(SliceTestCase, cls).init_class_fixtures()
day = cls.trading_schedule.day
cls.dates = dates = date_range(

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

I would just have these use the calendar provided by WithNYSETradingDays.

@@ -621,6 +621,193 @@ def rank(self, method='ordinal', ascending=True, mask=NotSpecified):
"""
return Rank(self, method=method, ascending=ascending, mask=mask)
@expect_types(
target=Slice, correlation_length=int, mask=(Filter, NotSpecifiedType),

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

We'll need to update target here if we add support for Factor <-> Factor correlations. This is fine for now though.

This comment has been minimized.

@dmichalowicz

dmichalowicz Jun 21, 2016

Contributor

Yup, that's already being done here: #1291

@@ -37,6 +39,9 @@ def __init__(self, columns=None, screen=None):
if columns is None:
columns = {}
for term in columns.values():

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

Can we push this into a validate_column method? (We're duplicating this logic in add.)

@@ -72,6 +77,9 @@ def add(self, term, name, overwrite=False):
Whether to overwrite the existing entry if we already have a column
named `name`.
"""
if term.ndim == 1:
raise UnsupportedPipelineOutput()

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

It'd be nice for this error to tell you what the term was (the name and the value).

masked_assets = assets[col_mask]
out_row = out[idx][col_mask]
inputs = self._format_inputs(windows, col_mask)

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

I'm a little nervous about how much work we're doing in this loop now. On my machine a 1-year USEP spends about a second of actual execution time is just in _format_inputs.

col_mask = mask[idx]
masked_out = out[idx][col_mask]
# Never apply a mask to 1D outputs.
col_mask = array([True]) if self.ndim == 1 else mask[idx]

This comment has been minimized.

@ssanderson

ssanderson Jun 21, 2016

Member

I think this is only correct for the output col_mask.

If I pass a mask to an ndim=1 CustomFactor, I should still get my inputs masked.

@coveralls

This comment has been minimized.

coveralls commented Jun 21, 2016

Coverage Status

Coverage increased (+0.1%) to 79.909% when pulling ef19275 on pipeline-single-columns into 4c2f0e8 on master.

@dmichalowicz dmichalowicz force-pushed the pipeline-single-columns branch from 2c661b0 to 393f82e Jun 23, 2016

@coveralls

This comment has been minimized.

coveralls commented Jun 23, 2016

Coverage Status

Coverage increased (+0.1%) to 82.356% when pulling 393f82e on pipeline-single-columns into e510cbb on master.

@dmichalowicz dmichalowicz merged commit abd10d0 into master Jun 23, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dmichalowicz dmichalowicz deleted the pipeline-single-columns branch Jun 23, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment