Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Generic Downsampling to Pipeline #1394

Merged
merged 25 commits into from Aug 18, 2016
Merged

Add Generic Downsampling to Pipeline #1394

merged 25 commits into from Aug 18, 2016

Conversation

@ssanderson
Copy link
Contributor

@ssanderson ssanderson commented Aug 16, 2016

Adds a new downsample method to all computable terms. Computable
terms (Filters, Factors, and Classifiers) can be downsampled to yearly,
quarterly, monthly, or weekly frequency.

The result of term.downsample is a new term of the same
family (Filter/Factor/Classifier) as term. The downsampled term
computes by delegating to the original term; repeatedly calling its
compute method with length-1 date ranges.

Downsampled terms take advantage of a new compute_extra_rows Term
method, which allows terms to dynamically request that additional extra
rows of themselves be computed based on the dates for which they're
being computed. This ensures, for example, that a monthly-downsampled
term always computes at the start of a month, even when a
naively-calculated pipeline window would end in the middle of the month.


self._frozen = False
parents = set()
for term in itervalues(terms):
self._add_to_graph(term, parents, extra_rows=0)
for name, term in iteritems(terms):

This comment has been minimized.

@twiecki

twiecki Aug 16, 2016
Contributor

Seems like name doesn't get used.

----------
frequency : str, {'Y', 'Q', 'M', 'W'}
A string indicating the desired sampling rate.
'Y' -> sample on the first trading day of each calendar year

This comment has been minimized.

@twiecki

twiecki Aug 16, 2016
Contributor

Often you also want the last trading day of a YQMW. Not saying it has to be part of this PR but preferably the syntax does not exclude that from getting implemented.

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

we could use pandas strings like M for month end and MS for month start

This comment has been minimized.

@twiecki

twiecki Aug 16, 2016
Contributor

+1 for using pandas' standard.

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

My only objection to this is that pandas inexplicably uses A and AS for yearly instead of Y and YS, which I have never once successfully remembered.

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

@twiecki @llllllllll my other concern with using the pandas conventions is that I think calculating at the start of a period will be the more common desired outcome, but the more obvious period names of 'M', 'Y' and 'Q' mean all mean the end of the period. Pandas also doesn't have a start/end distinction for weekly frequency. Given that I think relatively few of our users are going to know the pandas symbols by heart, I don't think there's a ton to be gained by copying their conventions. At the same time, there are probably enough users who are familiar with pandas that they'd be confused if we re-used 'M' and 'Q' but had them mean different things.

In light of all of the above, what do you think about just using slightly-more-verbose strings like 'month_start', 'year_start', 'quarter_end'?

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

I agree that we should not make it close but slightly different to pandas. If you really don't like the pandas version then I think the verbose names are fine. I think people will look this up when they need to use it anyways so following the pandas convention can help the few people that do know them.

This comment has been minimized.

@twiecki

twiecki Aug 17, 2016
Contributor

I only knew the BMonthStart and BMonthEnd date offsets to begin with. The verbose names sound fine to me as well.

@@ -527,7 +527,7 @@ def init_class_fixtures(cls):
cls.sim_params = cls.make_simparams()


class WithTradingSessions(WithTradingCalendars):
class WithTradingSessions(WithTradingCalendars, WithDefaultDateBounds):

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

Looks like the order of these parent classes needs to be switched

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

The order here shouldn't matter; both WithTradingCalendars and WithDefaultDateBounds are direct subclasses of object. The tests are currently failing because another class is manually inheriting from both WithTradingSessions and WithDefaultDateBounds. The right fix is to remove the duplication at that use-site.

"""
def __init__(self, terms):
super(TermGraph, self).__init__(self)
super(TermGraph, self).__init__()

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

Why this change?

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

This was just a bug. We were passing self twice to the parent constructor, which happened to work because the constructor of DiGraph takes another graph as its first argument, which is copied. In this case we were forwarding an empty graph to the parent.


"""
A single field from a multi-output factor.
"""

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

🎉

terms : dict
A dict mapping names to final output terms.
all_dates : pd.DatetimeIndex
The dates fo

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

Incomplete sentence 😉

default_screen : zipline.pipeline.term.Term
Term to use as a screen if self.screen is None.
"""
return TermGraph(self._prepare_graph_terms())

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

I think this is missing its arguments

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

Good catch. The codepath that calls this isn't tested because it relies on having the dot program installed. I'll see if I can at least throw in a smoketest for this.

initial_workspace={self._root_mask_term: root_mask_values},
initial_workspace={
self._root_mask_term: root_mask_values,
self._root_mask_dates_term: dates.values[:, None],

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

Maybe use as_column here?

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

I think this is what I actually wrote as_column for and then forgot to use it. Will update.

return a[:, None]


def changed_locations(a, include_first):

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

Do you think this is worth adding a short test for?

This comment has been minimized.

@ssanderson

ssanderson Aug 16, 2016
Author Contributor

The example here is run automatically by doctest. Do you think it needs something beyond that?

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 16, 2016
Contributor

No I think that is sufficient (I did not know that)

@dmichalowicz
Copy link
Contributor

@dmichalowicz dmichalowicz commented Aug 16, 2016

@ssanderson Done with a first review (looks great!). I only skimmed through most of the tests though, so if there were certain parts of the tests you think I should look at more in depth look just let me know.

# i is the loop iteration variable below.
return [a[[i]] for a in inputs]

def skip_this_input():

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 17, 2016
Contributor

Would this be more aptly named skip_this_date?

This comment has been minimized.

@ssanderson

ssanderson Aug 17, 2016
Author Contributor

I would think that that would encompass the whole action that happens in the loop iteration, whereas this just applies to the part that interacts with inputs.

@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.03%) to 85.754% when pulling 7a2437e on downsample into 4642fd2 on master.

1 similar comment
@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.03%) to 85.754% when pulling 7a2437e on downsample into 4642fd2 on master.

@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.07%) to 85.792% when pulling 942118f on downsample into 4642fd2 on master.

def _downsampled_type(self):
return DownsampledFactor
return DownsampledMixin.make_downsampled_type(Factor)

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 17, 2016
Contributor

Can you remove DownsampledFactor now?

Parameters
----------
term : {t}
{{frequency}}

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 17, 2016
Contributor

Does this require the templated_docstring decorator?

This comment has been minimized.

@dmichalowicz

dmichalowicz Aug 17, 2016
Contributor

Ah never mind, didn't see below

Scott Sanderson added 8 commits Aug 12, 2016
They belong on LoadableTerm instead.
- Split out extra_rows handling into an `ExecutionPlan` subclass.
  `ExecutionPlan` now requires the dates and calendar against which a
  set of terms will be computed, and now defers to a term's
  `compute_extra_rows` method when deciding how many extra rows are
  required to compute for that term. This will allow downsampled terms
  to request enough extra rows to guarantee that we can maintain consistent
  calculation dates.

  As a consequence of the above, `TermGraph` now only deals with logical
  dependencies, not with metadata surrounding extra row calculations.
  This means that TermGraph can be used to generate dependency
  visualizations in interactive contexts where we don't yet have a
  calendar or start/end dates.

- Refactored test_{filter,factor,classifier} to use check_terms instead
  of run_graph.  This makes it easier to make changes to TermGraph,
  since the testing interface is now to simply provide a dict of terms.

- Refactored BasePipelineTestCase to use fixtures to create an asset
  finder.  This fixes a potential leak of the test's asset db, which was
  not being explicitly cleaned up.

- Refactored test_technical to use BasePipelineTestCase.

- Added a new special term, `InputDates()`, which can be used to request
  date labels for inputs.  Like `AssetExists`, `InputDates` is provided
  in the initial workspace by default.

- Added a default (failing) `_compute` method to `AssetExists` which
  provides a more useful error than AttributeError.
Scott Sanderson added 11 commits Aug 16, 2016
- Add WithDefaultDateBounds, since we use alias('START_DATE') and
  alias ('END_DATE').

- Fix copypasta in assertion.
Adds a new ``downsample`` method to all computable terms.  Computable
terms (Filters, Factors, and Classifiers) can be downsampled to yearly,
quarterly, monthly, or weekly frequency.

The result of ``term.downsample`` is a new term of the same
family (Filter/Factor/Classifier) as ``term``.  The downsampled term
computes by delegating to the original term; repeatedly calling its
``compute`` method with length-1 date ranges.

Downsampled terms take advantage of a new ``compute_extra_rows`` Term
method, which allows terms to dynamically request that additional extra
rows of themselves be computed based on the dates for which they're
being computed.  This ensures, for example, that a monthly-downsampled
term always computes at the start of a month, even when a
naively-calculated pipeline window would end in the middle of the month.
Consolidate docs and mixin applications into one place.
Make sure that WithDefaultDateBounds is last in everyone's mro().
Scott Sanderson
@ssanderson ssanderson force-pushed the downsample branch from 942118f to 659ba57 Aug 17, 2016
@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.09%) to 85.791% when pulling 659ba57 on downsample into 440806a on master.

@ssanderson ssanderson force-pushed the downsample branch from 4ed650b to 0672a35 Aug 17, 2016
@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.09%) to 85.791% when pulling 4ed650b on downsample into 440806a on master.

@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.05%) to 85.753% when pulling 0672a35 on downsample into 440806a on master.

@coveralls
Copy link

@coveralls coveralls commented Aug 17, 2016

Coverage Status

Coverage increased (+0.09%) to 85.791% when pulling 0672a35 on downsample into 440806a on master.

@ssanderson ssanderson force-pushed the downsample branch from 7c3e611 to b1894e0 Aug 17, 2016
@coveralls
Copy link

@coveralls coveralls commented Aug 18, 2016

Coverage Status

Coverage increased (+0.2%) to 85.868% when pulling 7c3e611 on downsample into 440806a on master.

@coveralls
Copy link

@coveralls coveralls commented Aug 18, 2016

Coverage Status

Coverage increased (+0.2%) to 85.868% when pulling b1894e0 on downsample into aae9f84 on master.

@dmichalowicz
Copy link
Contributor

@dmichalowicz dmichalowicz commented Aug 18, 2016

Other than the to_simple_graph fix, looks good to me

@coveralls
Copy link

@coveralls coveralls commented Aug 18, 2016

Coverage Status

Coverage increased (+0.2%) to 85.911% when pulling 5a5353b on downsample into aae9f84 on master.

@ssanderson ssanderson merged commit acd5ef0 into master Aug 18, 2016
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ssanderson ssanderson deleted the downsample branch Aug 18, 2016
bartosh pushed a commit to bartosh/zipline that referenced this pull request Sep 27, 2016
Add Generic Downsampling to Pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.