New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline missing values #962

Merged
merged 20 commits into from Feb 16, 2016

Conversation

Projects
None yet
2 participants
@ssanderson
Member

ssanderson commented Jan 21, 2016

Adds support for specifying missing_values to pipeline Terms, and makes BoundColumn.latest return a Filter instead of a Factor for expressions of dtype bool.

This also removes all implicit coercion behavior from AdjustedArray. PipelineLoaders are now responsible for ensuring that loaded values match the dtypes of requested columns.

@llllllllll tagging you on this since most of the changes here are actually in the BlazeLoader and its tests.

self.assertIsInstance(TestingDataSet.datetime_col.latest, Factor)
self.assertIsInstance(TestingDataSet.int_col.latest, Factor)

def test_failure_timing_on_bad_missing_values(self):

This comment has been minimized.

@ssanderson

ssanderson Jan 21, 2016

Member

@llllllllll I think this addresses the concern you raised in person today.

I originally had made Column fail eagerly, but that was annoying because the error couldn't tell you the name of the column that failed. The compromise here is to fail in _BoundColumnDescr, which means we still barf in class construction, but we can give you an error that includes the name of the bad column.

# property of correctly handling `NaN`.
self.assertIs(column.missing_value, column.latest.missing_value)

def test_failure_timing_on_bad_missing_values(self):

This comment has been minimized.

@ssanderson

ssanderson Jan 21, 2016

Member

@llllllllll I think this addresses the concern you raised in person today.

I originally had made Column fail eagerly, but that was annoying because the error couldn't tell you the name of the column that failed. The compromise here is to fail in _BoundColumnDescr, which means we still barf in class construction, but we can give you an error that includes the name of the bad column.

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

Class construction time seems fine, it is worth it to know which column is incorrect and I imagine that very few (if any) people will be creating columns outside of a dataset.

from zipline.utils.test_utils import check_arrays, parameter_space


def moving_window(array, nrows):

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

should we move this to numpy_utils?

This comment has been minimized.

@ssanderson

ssanderson Jan 21, 2016

Member

I don't think we need to. Anything that needs to do a moving window efficiently probably wants to use AdjustedArray. This was just a simple utility for testing.

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from 0f6c989 to 5e0a883 Jan 21, 2016

@@ -296,7 +324,7 @@ def make_simple_equity_info(sids, start_date, end_date, symbols=None):
sids : array-like of int
start_date : pd.Timestamp
end_date : pd.Timestamp
symbols : list, optional
symbols : list, optionaln

This comment has been minimized.

@llllllllll
c_adjs,
c.missing_value,
)
return out

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

I think this is probably nicer as a comprehension, but up to you.

This comment has been minimized.

@ssanderson

ssanderson Jan 22, 2016

Member

I wrote this as a comprehension and then rewrote it this way because I thought it was easier to follow. Going to leave it as is.

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from 5e0a883 to 10a1945 Jan 21, 2016

return Latest(
inputs=(self,),
dtype=self.dtype,
missing_value=self.missing_value,

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

what does a missing value mean in a filter?
edit: nevermid, charlie

def test_nothing(self):
# Ensure that there's at least one "real" test in the class, or else
# our {setUp,tearDown}Class won't be called if, for example,
# `parameter_space` returns None.

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

this is a great test class

@classmethod
def tearDownClass(cls):
# This is the only actual test here.
assert cls.xy_invocations == list(product(cls.x_args, cls.y_args))

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

s/py.test/nose/

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

also, is the order part of the contract of this function? if not should we make this a set?

This comment has been minimized.

@ssanderson

ssanderson Jan 22, 2016

Member

The way it's written right now you're guaranteed to get the result of calling itertools.product with the values in the order that they appear in the function's parameter list. I don't see a reason to change that.

This comment has been minimized.

@llllllllll

llllllllll Jan 22, 2016

Member

Sure, but that is an implementation detail, no?

This comment has been minimized.

@ssanderson

ssanderson Jan 22, 2016

Member

Eh, I think it's fine to make it part of the contract. It's nice to have defined semantics for the ordering. I just pushed a test asserting that we flip the order if we flip the arglist.

This comment has been minimized.

@llllllllll
# the user fails to provide a missing value for a dtype that requires
# one (e.g. int64), but still enables us to provide an error message
# that points to the name of the failing column.
if missing_value is NotSpecified:

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

an we also check that the missing value is coercable to dtype here? For example:

In [17]: class Ds(DataSet):
    c = Column('int64', np.nan)
   ....:     

In [18]: Ds.c
Out[18]: Ds.c::int64
Usage
-----
>>> class SomeTestCase(TestCase):

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member
ERROR: test_test_utils_docs (tests.test_doctests.DoctestTestCase)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/quantopian/zipline/tests/test_doctests.py", line 68, in test_test_utils_docs

    self._check_docs(test_utils)

  File "/home/travis/build/quantopian/zipline/tests/test_doctests.py", line 49, in _check_docs

    raise e.exc_info[1]

NameError: name 'TestCase' is not defined
yield SimplePipelineEngine(get_loader, calendar, finder)


def parameter_space(**params):

This comment has been minimized.

@llllllllll

llllllllll Jan 21, 2016

Member

can you provide a see also to subtest and have subtest have a see also to parameter_space? I would also put a recomendation to use this function in the docstring of subtest

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from 00dcf67 to 909b9d8 Jan 22, 2016

@ssanderson ssanderson force-pushed the master branch from 110798b to 0dac2e0 Jan 22, 2016

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from 909b9d8 to fe4015c Jan 22, 2016

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from fe4015c to 403c5a5 Feb 8, 2016

@mtydykov mtydykov referenced this pull request Feb 10, 2016

Merged

Buyback auth in pipeline #990

@ssanderson ssanderson force-pushed the pipeline-missing-values branch from 403c5a5 to f27b415 Feb 13, 2016

ssanderson added a commit that referenced this pull request Feb 16, 2016

@ssanderson ssanderson merged commit b632c41 into master Feb 16, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the pipeline-missing-values branch Feb 16, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment