New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple outputs for custom factors #1119

Merged
merged 1 commit into from Apr 21, 2016

Conversation

Projects
None yet
5 participants
@dmichalowicz
Contributor

dmichalowicz commented Apr 8, 2016

Add an optional outputs parameter to CustomFactor allowing multiple outputs to be computed and returned from a single CustomFactor, each output of which is itself a Factor.

@coveralls

This comment has been minimized.

coveralls commented Apr 8, 2016

Coverage Status

Coverage increased (+0.04%) to 84.308% when pulling 233602c on multiple-factor-outputs into 675eff4 on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 8, 2016

Coverage Status

Coverage increased (+0.04%) to 84.309% when pulling 835af4e on multiple-factor-outputs into 675eff4 on master.

@@ -510,6 +520,42 @@ def test_rolling_and_nonrolling(self):
),
)
def test_factor_with_multiple_outputs(self):

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 11, 2016

Contributor

TODO: Add better test cases

self.factor = factor
self.attribute = attribute
self.compute = factor.compute
self._compute = factor._compute

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

I'm thinking maybe this should implement its own _compute method instead of adopting it from the parent CustomFactor.

class RecarrayFactor(Factor):
siblings = {}

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

This dict is going to be shared by all instances of RecarrayFactor. You probably want to just have this be NotSpecified and make a new dict.

sibling = siblings.pop()
workspace[sibling] = term_output[sibling.attribute]
else:
workspace[term] = term_output

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

Probably should move this into a helper function

self._inputs_for_term(term, workspace, graph),
mask_dates,
assets,
mask,
)
if hasattr(term, 'siblings'):
siblings = term.siblings[term.factor]
while siblings:

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

This will break if you compute the same term twice.

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

As a general rule, any term we use in the Pipeline API needs to be immutable.

@dmichalowicz dmichalowicz force-pushed the multiple-factor-outputs branch from ebf033a to 3fc69c1 Apr 12, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.04%) to 84.309% when pulling 3fc69c1 on multiple-factor-outputs into 3f2a574 on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.003%) to 84.274% when pulling 3fc69c1 on multiple-factor-outputs into 3f2a574 on master.

def compute(self, today, assets, out, open, close):
out.double_open[:] = open * 2
out.double_close[:] = close * 2

This comment has been minimized.

@twiecki

twiecki Apr 12, 2016

Contributor

Nice API!

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.03%) to 84.297% when pulling d21ece4 on multiple-factor-outputs into 3f2a574 on master.

enough outputs and that factor does not have class-level default outputs.
"""
msg = (
"{termname} requires at least two outputs, but was given "

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

"at least two" seems more specific than this should be?

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

Yes; I'll make RecarrayFactor support single outputs as well

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

I actually meant that this might want to be at least {N}, where N is an arg, but I don't know if there are cases where we know we need more than two outputs.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

Ah ok. So for the purpose of RecarrayFactors as we have them it makes sense, but wouldn't hurt to be more generalizable in case we come across such a case.

window_length : int, optional
Number of rows to pass for each input. If this argument is not passed
to the CustomFactor constructor, we look for a class-level attribute
named `window_length`.
Returns

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

Generally you don't put a Returns block in a class, since the assumption (which is true here?) is that the return value of the class is an instance of the class.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

Yeah; wasn't sure the best way to convey the possibility of returning a RecarrayFactor. Will remove for now.

@@ -109,7 +111,14 @@ def _compute(self, windows, dates, assets, mask):
compute = self.compute
missing_value = self.missing_value
params = self.params
out = full_like(mask, missing_value, dtype=self.dtype)
outputs = self.outputs
if outputs is not NotSpecified and len(outputs) > 1:

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

If you specify outputs at all, I'd say you should always get a recarray with all the fields of your outputs.

if outputs is NotSpecified:
outputs = cls.outputs
if outputs is NotSpecified:
outputs = tuple()

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

You can get an empty tuple with 5 fewer characters :).

else:
raise AttributeError(
"This factor has no output called '{}'.".format(name)
)

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

This is for allowing something like:

class MyFactor(CustomFactor):
    outputs = ['alpha', 'beta']
    ...
my_factor = MyFactor()
alpha, beta = my_factor.alpha, my_factor.beta

Should we keep both this and __iter__ and support either way of accessing the outputs, or just pick one single option?

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

I think having both of these makes sense.

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage decreased (-0.008%) to 84.263% when pulling a872419 on multiple-factor-outputs into 3f2a574 on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.03%) to 84.297% when pulling a872419 on multiple-factor-outputs into 3f2a574 on master.

@dmichalowicz dmichalowicz force-pushed the multiple-factor-outputs branch from 4596d90 to 9fce2f8 Apr 12, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.06%) to 84.294% when pulling 9fce2f8 on multiple-factor-outputs into d2ab5ed on master.

def _validate(self):
super(RecarrayFactor, self)._validate()
if not self.outputs:
raise TermOutputsNotSpecified(termname=type(self).__name__)

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

In thinking about it, I don't think a RecarrayFactor even needs to specify any outputs. At the end of the day it acts how all other regular factors do with a single out. So this can all be removed.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 12, 2016

Contributor

Actually, this check should be performed on the inputted factor, i.e. check that the parent's factor.outputs is not empty.

class RecarrayFactor(SingleInputMixin, Factor):
def __new__(cls, factor, attribute):

This comment has been minimized.

@ssanderson

ssanderson Apr 12, 2016

Member

This should probably forward the mask from its parent.

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.09%) to 84.323% when pulling 6bf612b on multiple-factor-outputs into d2ab5ed on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 12, 2016

Coverage Status

Coverage increased (+0.09%) to 84.323% when pulling bfdb648 on multiple-factor-outputs into d2ab5ed on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 13, 2016

Coverage Status

Coverage increased (+0.02%) to 84.252% when pulling 7df8544 on multiple-factor-outputs into d2ab5ed on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 13, 2016

Coverage Status

Coverage increased (+0.01%) to 84.245% when pulling fb0d521 on multiple-factor-outputs into d2ab5ed on master.

@dmichalowicz dmichalowicz changed the title from WIP: Support multiple outputs for custom factors to Support multiple outputs for custom factors Apr 13, 2016

# Ensure that both methods of accessing our outputs return the same
# things.
self.assertIs(open_price, multiple_outputs.open)

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

These checks probably belong in test_term.py.

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

Other things to verify in test_term:

  • Constructing two instances of the same factor with different output strings (including re-orderings of the same set of strings) should produce different factor instances. The RecarraySlices produced by those factors should be the same even if the names overlap.
  • We should fail with a useful error at construction time if the user passes an empty sequence as output.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 20, 2016

Contributor

The RecarraySlices produced by those factors should be the same even if the names overlap.

Do you mean they should be different?

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

Yes

On Apr 20, 2016, at 11:13 AM, David Michalowicz notifications@github.com wrote:

In tests/pipeline/test_engine.py:

  •    open = USEquityPricing.open
    
  •    close = USEquityPricing.close
    
  •    engine = SimplePipelineEngine(
    
  •        lambda column: self.loader, self.dates, self.asset_finder,
    
  •    )
    
  •    def create_expected_results(expected_value, mask):
    
  •        expected_values = where(mask, expected_value, nan)
    
  •        return DataFrame(expected_values, index=dates, columns=assets)
    
  •    multiple_outputs = MultipleOutputs()
    
  •    open_price, close_price = MultipleOutputs()
    
  •    # Ensure that both methods of accessing our outputs return the same
    
  •    # things.
    
  •    self.assertIs(open_price, multiple_outputs.open)
    
    The RecarraySlices produced by those factors should be the same even if the names overlap.

Do you mean they should be different?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

class-level attribute named `inputs`.
outputs : iterable, optional

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

I sometimes write this is as outputs : iterable[str], optional.

out.alpha[:] = computed_alpha
out.beta[:] = computed_beta
# Each output is returned as its own Factor upon instantiation.

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

You might also want to show the attribute-access style here.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 20, 2016

Contributor

Added

return RecarrayFactor(factor=self, attribute=name)
else:
raise AttributeError(
"This factor has no output called '{}'.".format(name)

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

The more idiomatic way to get the quotes here is {!r}.

If the factor doesn't have custom outputs at all, then this should probably just re-raise a normal AttributeError.

If it does have custom outputs, then we should probably include the name of the failing factor instead of just saying "this factor".

)
def __iter__(self):
if len(self.outputs) < 2:

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

I think this should still work if outputs has length 1.

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

See note below re: using NotSpecified if a user doesn't supply outputs. I would make this check:

if self.outputs is NotSpecified:
    raise ValueError(...)
def __iter__(self):
if len(self.outputs) < 2:
raise ValueError('This factor does not have multiple outputs.')

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

Same not as above "this factor" is much less useful than "FactorThatIThoughtHadMultipleOutputs"

)
def _init(self, attribute, *args, **kwargs):
self.attribute = attribute

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

Generally I've been keeping internal factor state in _variables. See, for example, Rank and GroupedRowTransform.

return iter(map(RecarrayFactor_, self.outputs))
class RecarrayFactor(SingleInputMixin, Factor):

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

I would probably call this RecarrayField to make it clear that it's not producing a recarray, it's taking a slice of one.

if outputs:
out = recarray(
mask.shape,
formats=[self.dtype.str] * len(outputs),

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

We might at some point want to support different dtypes for the different fields. It's probably a fine restriction to start with that multi-output factors have to have a fixed dtype for all fields, but we should call that out somewhere in the docs.

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 20, 2016

Contributor

Added in CustomFactor doc string

formats=[self.dtype.str] * len(outputs),
names=outputs,
)
out[:] = array([missing_value])

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

Do we need to cast to an array here? It seems to work to just assign the missing value directly:

In [2]: recarray((2, 2), formats=['i', 'i'], names=['a', 'b'])
Out[2]:
rec.array([[(-396702552, 32663), (-396702552, 32663)],
       [(-392389200, 32663), (-392592440, 32663)]],
      dtype=[('a', '<i4'), ('b', '<i4')])

In [3]: r = recarray((2, 2), formats=['i', 'i'], names=['a', 'b'])

In [4]: r[:] = 0

In [5]: r
Out[5]:
rec.array([[(0, 0), (0, 0)],
       [(0, 0), (0, 0)]],
      dtype=[('a', '<i4'), ('b', '<i4')])

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 19, 2016

Contributor

Huh, could've sworn I tried that. You're right though, it works

if outputs is NotSpecified:
outputs = cls.outputs
if outputs is NotSpecified:
outputs = ()

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

I think we should just leave this as NotSpecified here, otherwise we can't detect and raise if a user does:

class MyDumbFactor(CustomFactor):
    outputs = ()
@@ -27,6 +27,13 @@ Enhancements
which manages expiration of entries based on the `dt` supplied to the `get`
method.
* Implemented :class:`zipline.pipeline.factors.RecarrayFactor`, a new pipeline
term designed to be the output type of a CustomFactor with multiple outputs.

This comment has been minimized.

@ssanderson

ssanderson Apr 19, 2016

Member

This should reference the issue number.

Also, I think the next release is going to be 1.0, not 0.9.1, so we should probably move the notes here to 1.0.0.txt

This comment has been minimized.

@richafrank

richafrank Apr 19, 2016

Member

Looks like we need to merge the two and get rid of 0.9.1...

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 20, 2016

Contributor

I can do this move; should it go in a separate PR though?

raise ValueError(
'{factor} does not have multiple outputs.'.format(
factor=self.__class__.__name__,
)

This comment has been minimized.

@dmichalowicz

dmichalowicz Apr 20, 2016

Contributor

Modified the language here and above

@dmichalowicz

This comment has been minimized.

Contributor

dmichalowicz commented Apr 20, 2016

@ssanderson Should be ready for another look

'''
dtype = float64_dtype
def __getattr__(self, attribute_name):
if self.outputs is NotSpecified:
raise AttributeError()

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

This should be AttributeError(attribute_name).

else:
raise AttributeError(
'Instance of {factor} has no output called {attr!r}.'.format(
factor=self.__class__.__name__, attr=attribute_name,

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

I tend to prefer type(self).__name__ for this. Not a big deal though.

# Reordering outputs.
self.assertIsNot(
multiple_outputs,
MultipleOutputs(

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

Another option for this would be list(reversed(MultipleOutputs.outputs))

@@ -243,6 +288,30 @@ class SomeFactor(Factor):
self.assertIsNot(orig_foobar_instance, SomeFactor())
def test_instance_non_caching_multiple_outputs(self):
multiple_outputs = MultipleOutputs()

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

Might want to verify that this produces the same result if you do MultipleOutputs(outputs=MultipleOutputs.outputs).

@@ -433,10 +433,20 @@ class TermInputsNotSpecified(ZiplineError):
msg = "{termname} requires inputs, but no inputs list was passed."
class TermOutputsNotSpecified(ZiplineError):

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

This should be TermOutputsEmpty, not NotSpecified. NotSpecified means they weren't passed at all.

alpha = multiple_outputs.alpha
beta = multiple_outputs.beta
Note: If a CustomFactor has multiple outputs, all outputs must be of the

This comment has been minimized.

@ssanderson

ssanderson Apr 20, 2016

Member

Grammar nit: "must be of the same" is a bit awkward. "must have the same" is cleaner IMO.

@dmichalowicz dmichalowicz force-pushed the multiple-factor-outputs branch from 0eb4d12 to ec68d04 Apr 20, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 20, 2016

Coverage Status

Coverage increased (+0.05%) to 84.245% when pulling ec68d04 on multiple-factor-outputs into 842c47b on master.

@dmichalowicz dmichalowicz force-pushed the multiple-factor-outputs branch from ec68d04 to c4d190c Apr 21, 2016

@coveralls

This comment has been minimized.

coveralls commented Apr 21, 2016

Coverage Status

Coverage increased (+0.05%) to 84.243% when pulling c4d190c on multiple-factor-outputs into 2179553 on master.

@dmichalowicz dmichalowicz force-pushed the multiple-factor-outputs branch from c4d190c to d9bfcaa Apr 21, 2016

@ssanderson

This comment has been minimized.

Member

ssanderson commented Apr 21, 2016

LGTM. @dmichalowicz feel free to merge when appveyor finishes.

@dmichalowicz dmichalowicz merged commit 28ffee2 into master Apr 21, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dmichalowicz dmichalowicz deleted the multiple-factor-outputs branch Apr 21, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment