New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFC API #630

Merged
merged 5 commits into from Jul 29, 2015

Conversation

Projects
None yet
4 participants
@ssanderson
Member

ssanderson commented Jun 30, 2015

NOTE: Throughout this description I've added sections marked with RFC, meaning "Request For Comments", on locations where I'm particularly interested in feedback.

This patch lays the groundwork for a compute engine designed to facilitate construction of factor-based universe screening and portfolio allocation.

Goals

  1. Provide a high-level, declarative API for representing trailing window computations on large datasets.
  2. Provide groundwork for a dynamic portfolio construction API using Filters and Factors.
  3. Do 1 and 2 in way that's sufficiently abstracted from the underlying computation process that we can change the internals without breaking public APIs.
  4. Be fast enough and memory efficient enough to support computation on universes with thousands of assets.

Content

zipline.modelling

Contains entities that can be used to express computations as dependency graphs. Each node in such a graph is an instance of the base Term class, defined in zipline.modelling.term. Dependency graphs are executed by instances of FFCEngine, defined in zipline.modelling.engine. The only interesting implementation is SimpleFFCEngine, but there's also a NoOpFFCEngine which is used when no FFC Loaders can be found.

There are three important subclasses of Term:

  • Factor, representing the results of numerical-valued computations. Currently we only support float64 as a dtype for Factor, but it would be mostly straightforward to add integral types. Factors can be computed and referenced directly, and factors can be converted into Filters via comparison operators (>, >=, ==, etc). There are also methods for more fine-grained statistical computations like percentile analysis. In general, these methods work by producing new Term instances. For example, some_factor.percentile_between(10, 20) returns an instance of PercentileFilter, which will perform the desired computation when actually evaluated by an engine. There is experimental support for defining custom factors, and some examples have been added in zipline.modelling.factors.technical.
  • Filter, representing the results of boolean-valued computations. Filters are generally with method calls on an instance of Factor, but there's an experimental CustomFilter for direct expression of boolean-valued computations.

Most of the complexity of the Term class is dedicated to a mechanism that makes it so that calling a Term constructor twice with the same arguments will return the same instance. (RFC: Does the implementation of this make sense to others. Most of the explanation for how and why we're doing this is in Term.__new__ and Term.static_identity.) This is valuable because it gives us common subexpression folding for free, but it adds some boilerplate to the construction of new Term subclasses. This should only affect implementers of term combinators (i.e. functions that take terms and construct new terms) like Rank and PercentileFilter. For users who just want to define new Filters/Factors with rolling computations, there are experimental CustomFilter and CustomFactor classes, which just require that the user implement a compute function that will be called on rolling windows of data.

zipline.data.ffc

(RFC: I don't like this directory structure right now. Should we just move it to zipline.modelling? Somewhere else?)

Contains loaders and dataset definitions for inputs to the modelling API. The abstract API for a modelling dataloader is defined by FFCLoader (RFC: I think we should remove references to FFC everywhere. Thoughts on better names for this?)

New TradingAlgorithm api methods: add_factor, and add_filter. These methods can only be called from initialize, and are used to inform the algorithm that each day it should compute the given terms.
Computed factor results are made available through a new factors property of the data object in before_trading_start and handle_data. Computed filter results control which assets are available in the factor matrix on each day. An asset will only appear if it passed all registered filters. (RFC:: is this the right combinatorial behavior for filters? Should we even allow multiple filters to be registered? Should data.factors be a method rather than a property so that the user can request values that passed a specific filters instead? That would add a fairly significant performance burden.

@ssanderson ssanderson force-pushed the ffc-api-rebase branch 6 times, most recently from 4463031 to 11a598a Jun 30, 2015

@ssanderson ssanderson force-pushed the ffc-api-rebase branch 3 times, most recently from ae48987 to 6da3bb0 Jul 14, 2015

@ssanderson

This comment has been minimized.

Member

ssanderson commented Jul 15, 2015

@ehebert @richafrank @llllllllll @jfkirk @jbredeche

I'm still working on acceptance tests for this, but it's ready for review to be merged into Zipline. This is a big merge, so I'm not sure how we want to do the review for this. I've tagged all the people who I would expect might want to provide input.

@@ -0,0 +1,5 @@
Currently USEquityPricing is located in zipline.data.pricing

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

Note to self: delete this.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Jul 15, 2015

Particular notes:

The two most important modules here are zipline.modelling.term and zipline.modelling.engine. It'd be great if the people I tagged could read at least those two files so they have a basic sense of how the core of this works.

@ehebert, @jfkirk, and maybe @yankees714 and @StewartDouglas can you look at the changes in assets.py since you're actively overhauling that system? The most important function is the new lifetimes method, which gives you a matrix of whether all the assets we know about existed between dates x and y.

@llllllllll can you look at dataset.py and take a deeper look at the caching constructor in term.py?

their inputs are both conceptually immutable.
"""
if inputs is None:
inputs = tuple(cls.inputs)

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

no need to take the tuple of a tuple

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

User-defined Custom{Filter,Factor}s can put a list in the class scope.

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

ah, subclasses would overwrite the NotSpecified, I see.

.gitignore Outdated
@@ -57,4 +57,4 @@ docs/_build/*
benchmarks.db
# Vagrant temp folder
.vagrant
.vagrant

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

Note to self: fix this.

dtype=dtype,
*args, **kwargs
)
cls._term_cache[identity] = new_instance

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

new_instance = cls._term_cache[identity] = ...

domain = None
dtype = float64
_term_cache = {}

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

do we care that the Term object will never be gc'd? should this maybe be a WeakValueDict?

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

Hmm...I'm not sure. The terms themselves have very little data on them, and their expected lifetime is more or less the lifetime of a process as well. I'll think about it though.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Jul 15, 2015

@richafrank I don't know if I have specific code for you to look at except maybe the sections marked RFC in the long description above.

WindowLengthNotPositive,
WindowLengthNotSpecified,
)
from zipline.utils.lazyval import lazyval

This comment has been minimized.

@llllllllll

This comment has been minimized.

@twiecki

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

@twiecki what would you see as the benefit of relative imports here? IMO, relative imports are useful when you expect that any time you move a module, you'd also move the module from which you're importing. e.g. If i have

spam/
    __init__.py
    foo.py
    bar.py

and I expect that foo.py and bar.py will always be next to each other, then it's useful in foo to do something like:

from .bar import ...

because then if I rename the whole directory, I don't have to change foo.py at all.

Here, on the other hand, I don't think there's any expected relative relationship between zipline.modelling.term and zipline.utils.lazyval. The most likely change is that term.py gets moved somewhere else, which would almost certainly break any relative import.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Jul 15, 2015

@jfkirk it might also be good for you to read the through the stuff in zipline.data.ffc.loaders as we think about how we want to publicize Zipline's data ingestion APIs.

type=type(self).__name__,
inputs=self.inputs,
window_length=self.window_length,
domain=self.domain,

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

domain is unused

matrix : pd.DataFrame
A matrix of factors
"""
raise NotImplementedError()

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

I prefer to put the name of the not implemented function here to aid debugging. This is style.

if isinstance(term, Factor):
factor_names.append(name)
# Remove extra rows.
if extra:

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

Welp, this is a dumb block...

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

(This is supposed to remove the [extra:] slice so we don't make an extraneous copy when extra is 0.)

def factor_matrix(self, terms, start_date, end_date):
"""
Compute a factor matrix.

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

This probably needs a long description of the algorithm being implemented here.

finder = self._finder
start_idx, end_idx = self._calendar.slice_locs(start_date, end_date)
if start_idx < extra_rows:
# TODO: Use NoFurtherDataError from trading.py

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

should this be done?

@ssanderson

This comment has been minimized.

Member

ssanderson commented Jul 15, 2015

More bikeshedding questions: should we continue to call the DataSet USEquityPricing as opposed to just EquityPricing. I suspect that they'll always be the same data types, and baking US into the name seems like we're asking for a headache down the road. @jfkirk?

@@ -264,25 +264,39 @@ def populate_cache(self):
Populates the asset cache with all values in the assets
collection.
"""
# Wipe caches before repopulating
self.cache = {}

This comment has been minimized.

@ehebert

ehebert Jul 15, 2015

Member

@ssanderson I think you need to rebase against master.
The populate_cache method has been removed.

return tuple(chain(tup, (elem,))), len(tup)
def bad_op(op, left, right):

This comment has been minimized.

@llllllllll

llllllllll Jul 15, 2015

Member

could this not be a subclass of TypeError?

This comment has been minimized.

@ssanderson

ssanderson Jul 15, 2015

Member

Meaning something like this?

class BadOp(TypeError):
    def __init__(self, op, left, right):
        super(BadOp, self).__init__("Can't compute...".format(...))

I could see that

def _all_terms(self):
# Merge all three dicts.
return dict(

This comment has been minimized.

@ehebert

ehebert Jul 16, 2015

Member

Is this equivalent to the following?

all_terms = {}
all_terms.update(self._filters)
all_terms.update(self._factors)
all_terms.update(self._classifiers)
return all_terms

This comment has been minimized.

@ssanderson

@ssanderson ssanderson force-pushed the ffc-api-rebase branch 2 times, most recently from dc0d6fb to cef2dfa Jul 20, 2015

@ssanderson ssanderson force-pushed the ffc-api-rebase branch from 42e5537 to a806ac1 Jul 20, 2015

@ssanderson ssanderson force-pushed the ffc-api-rebase branch 8 times, most recently from f3762e5 to 91c54d2 Jul 23, 2015

ssanderson added some commits May 7, 2015

ENH: Compute engine architecture for FFC API.
This patch lays the groundwork for a compute engine designed to
facilitate construction of factor-based universe screening and portfolio
allocation.  It contains:

A new module, `zipline.modelling`, containing entities that can be used
to express computations as dependency graphs.  Each node in such a graph
is an instance of the base `Term` class, defined in
`zipline.modelling.term`.  Dependency graphs are executed by instances
of `FFCEngine`, defined in `zipline.modelling.engine`.

A new module, `zipline.data.ffc`, containing loaders and dataset
definitions for inputs to the modelling API.

New `TradingAlgorithm` api methods: `add_factor`, and `add_filter`.
These methods can only be called from `initialize`, and are used to
inform the algorithm that each day it should compute the given terms.
Computed factor results are made available through a new attribute of
the `data` object in `before_trading_start` and `handle_data`.  Computed
filter results control which assets are available in the factor matrix
on each day.
MAINT: Don't install non-zipline packages.
In particular, don't give anyone who installs zipline a global package
named 'tests'.  (sob)
MAINT: Add nullctx back to test_utils.
Temporary upstream compat for Quantopian code.

To be removed at the earliest possible convenience.

@ssanderson ssanderson force-pushed the ffc-api-rebase branch from 91c54d2 to f13e9fd Jul 29, 2015

@ssanderson ssanderson merged commit f13e9fd into master Jul 29, 2015

3 checks passed

Scrutinizer 4 new issues, 330 updated code elements
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@twiecki

This comment has been minimized.

Contributor

twiecki commented Jul 29, 2015

👍

@ehebert

This comment has been minimized.

Member

ehebert commented Jul 29, 2015

I second that 👍

@richafrank richafrank deleted the ffc-api-rebase branch Nov 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment