Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add estimates quarter loader to pipeline #1396

Merged
merged 16 commits into from Sep 27, 2016

Conversation

@mtydykov
Copy link
Contributor

@mtydykov mtydykov commented Aug 16, 2016

This PR adds an estimates quarter loader to be used with Pipeline.

Some open questions are:

  • What other tests do we need to add?
  • How do we ensure that the num_quarters attribute has been added to the dynamically generated dataset by the time load_adjusted_array is called?
@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from 75fd39e to 41e94f3 Aug 16, 2016
And other dataset-specific fields, where each row of the table is a
record including the sid to identify the company, the timestamp where we
learned about the announcement, and the date when the earnings will be z

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

extra z?

learned about the announcement, and the date when the earnings will be z
announced.
If the '{TS_FIELD_NAME}' field is not included it is assumed that we

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

I think that these format strings probably came from another docstring where we had __doc__ = """....""".format(SID_FIELD_NAME=SID_FIELD_NAME, TS_FIELD_NAME=TS_FIELD_NAME) which we should preserve.

class BlazeNextEstimatesLoader(BlazeEstimatesLoader):
loader = NextQuartersEstimatesLoader

def __init__(self,

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

Is this __init__ doing anything other than forwarding everything to the super? If so we can just remove it.

class BlazePreviousEstimatesLoader(BlazeEstimatesLoader):
loader = PreviousQuartersEstimatesLoader

def __init__(self,

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

same here

from zipline.pipeline.loaders.quarter_estimates import (
NextQuartersEstimatesLoader,
PreviousQuartersEstimatesLoader,
required_estimates_fields)

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

Can you close the paren on a new line?

@@ -1220,6 +1220,36 @@ def bind_expression_to_resources(expr, resources):
})


def load_raw_data(assets, dates, data_query_time, data_query_tz, expr,

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

This feels more like a utility than something that should be in core since it is not used by the core loader.

Also, could you add a docstring to document the use of this function?

This comment has been minimized.

@mtydykov

mtydykov Aug 16, 2016
Author Contributor

Ok, I can create a new blaze/utils.py and put it there.

from zipline.pipeline.loaders.frame import DataFrameLoader

ALL_DATES = 'dates'
FISCAL_QUARTER = 'fiscal_quarter'

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

these two are in common.py

Parameters
----------
yrs : np.Series

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

can we just call these years and quarters?


def cross_product(df1, df2):
df1['key'] = 1
df2['key'] = 1

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

key seems a little too likely to collide with existing keys. Could we use a uuid4 or something? Also, could we move this to pandas_utils?


self.base_column_name_map = base_column_name_map

def load_quarters(self, num_quarters, dates_sids, final_releases_per_qtr):

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

should this be an abstract class and abstract method?

This comment has been minimized.

@mtydykov

mtydykov Aug 16, 2016
Author Contributor

Yes, I'll add the appropriate decorator.

import pandas as pd
from six import viewvalues
from toolz import groupby
from zipline.pipeline.common import (

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

there should be one line of whitespace between the third party imports and internal imports.

num_qtrs_shift : int
The number of quarters to shift forward.

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

extra whitespace

"""

if num_qtrs_shift < 0:
raise AssertionError("Must pass a number of quarters >= 0")

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

this should be a ValueError since it is an exception based on user input to a function

SIMULTATION_DATES = 'dates'


def calc_forward_shift(yrs, qtrs, num_qtrs_shift):

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

calc_forward_shift and calc_backward_shift can both use the same implementation by normalizing to number of quarters and then splitting back out the year and date parts.

def normalize_quarters(years, quarters):
    return years * 4 + quarters - 1


def split_normalized_quarters(normalized_quarters):
    years, quarters = divmod(normalized_quarters, 4)
    return years, quarters + 1


def shift_quarters(by, years, quarters):
    return split_normalized_quarters(normalize_quarters(years, quarters) + by)
In [1]: ys = np.array([2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013])

In [2]: qs = np.array([1, 2, 3, 4, 1, 2, 3, 4])

In [3]: shift_quarters(2, ys, qs)
Out[3]: 
(array([2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014]),
 array([3, 4, 1, 2, 3, 4, 1, 2]))

In [4]: shift_quarters(-2, ys, qs)
Out[4]: 
(array([2011, 2011, 2012, 2012, 2012, 2012, 2013, 2013]),
 array([3, 4, 1, 2, 3, 4, 1, 2]))

This simplifies a lot of the edge cases here.

This comment has been minimized.

@mtydykov

mtydykov Aug 16, 2016
Author Contributor

divmod doesn't work on Series.

This comment has been minimized.

@mtydykov

mtydykov Aug 16, 2016
Author Contributor

Modified it a bit and it works great. Thanks!

result.pivot(index=SIMULTATION_DATES,
columns=SID_FIELD_NAME,
values=column_name),
adjustments=None

This comment has been minimized.

@llllllllll

llllllllll Aug 16, 2016
Contributor

No deltas?

This comment has been minimized.

@mtydykov

mtydykov Aug 16, 2016
Author Contributor

Not yet. We don't apply them in the events loader either, so I'd imagine that we'd want to implement both in the future.

int64_t last_row,
int64_t first_col,
int64_t last_col):
super(ArrayAdjustment, self).__init__(

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

this is just a nop __init__

first_col=first_col,
last_col=last_col,
)
assert (last_row + 1 - first_row) == len(values)

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

don't use an assert to check input to a function, this should raise a value error if not correct.

normalize_data_query_bounds,
normalize_timestamp_to_query_time,
)
ffill_across_cols)

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

close the paren on a new line

# For the number of things that we're grouping by (except TS), unstack
# the df
for _ in range(len(idx) - 1):
last_in_group = last_in_group.unstack()

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

last_in_group.unstack([-1, -2])

df[TS_FIELD_NAME].values.astype('datetime64[D]')
)]
if have_sids:
idx = [idx, SID_FIELD_NAME] + extra_groupers

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

we should add the extra groupers regardless.

self.estimates[FISCAL_YEAR_FIELD_NAME],
self.estimates[FISCAL_QUARTER_FIELD_NAME],
).astype(float)
for num_quarters, columns in groups_columns.iteritems():

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

.items

self.estimates[NORMALIZED_QUARTERS] = normalize_quarters(
self.estimates[FISCAL_YEAR_FIELD_NAME],
self.estimates[FISCAL_QUARTER_FIELD_NAME],
).astype(float)

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

why is this a float?

This comment has been minimized.

@mtydykov

mtydykov Aug 23, 2016
Author Contributor

Left over from a previous iteration. I'll remove.

# Forward fill values for each quarter.
ffill_across_cols(last_per_qtr, columns)
# Stack quarter and sid into the index
stacked_last_per_qtr = last_per_qtr.stack(

This comment has been minimized.

@llllllllll

llllllllll Aug 23, 2016
Contributor

why are we restacking the data? also this can be .stack([NORMALIZED_QUARTERS, SID_FIELD_NAME])

This comment has been minimized.

@mtydykov

mtydykov Aug 23, 2016
Author Contributor

I couldn't figure out any other way to perform the filtering/grouping operations in load_quarters.

@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch 2 times, most recently from 8f11ae6 to c5c9545 Aug 25, 2016
@coveralls
Copy link

@coveralls coveralls commented Sep 1, 2016

Coverage Status

Coverage increased (+0.09%) to 86.222% when pulling 48af429 on add_estimates_quarter_loader_to_pipeline into 92c2a4a on master.

1 similar comment
@coveralls
Copy link

@coveralls coveralls commented Sep 1, 2016

Coverage Status

Coverage increased (+0.09%) to 86.222% when pulling 48af429 on add_estimates_quarter_loader_to_pipeline into 92c2a4a on master.

@coveralls
Copy link

@coveralls coveralls commented Sep 1, 2016

Coverage Status

Coverage increased (+0.1%) to 86.266% when pulling 1335b08 on add_estimates_quarter_loader_to_pipeline into 92c2a4a on master.

@coveralls
Copy link

@coveralls coveralls commented Sep 1, 2016

Coverage Status

Coverage increased (+0.09%) to 86.221% when pulling 1335b08 on add_estimates_quarter_loader_to_pipeline into 92c2a4a on master.

@coveralls
Copy link

@coveralls coveralls commented Sep 20, 2016

Coverage Status

Coverage increased (+0.1%) to 86.815% when pulling d26c9a9 on add_estimates_quarter_loader_to_pipeline into e1b27c4 on master.

Pipeline({'est': SomeFactor()}),
start_date=start_idx,
end_date=pd.Timestamp('2015-01-20', tz='utc'), # last event date
# we have

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

I thought this was a fragment of a comment instead of part of the last comment because of the indentation. we could just have this all on one line before end_date=....

columns=[SID_FIELD_NAME,
'estimate',
'knowledge_date'])
df = df.pivot_table(columns='sid',

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

should we use SID_FILED_NAME here also?

backwards/forwards from a starting point.
"""
def test_quarter_normalization(self):
input_yrs = pd.Series([0] * 4, dtype=np.int64)

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

could we make this not a constant and also move the years into the 2000s?

This comment has been minimized.

@mtydykov

mtydykov Sep 20, 2016
Author Contributor

By 'not a constant', do you just mean not all the same year?

If the '{TS_FIELD_NAME}' field is not included it is assumed that we
start the backtest with knowledge of all announcements.
"""

__doc__ == __doc__.format(SID_FIELD_NAME=SID_FIELD_NAME,

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

This is a bool expr instead of an assignment.

This comment has been minimized.

@mtydykov

mtydykov Sep 20, 2016
Author Contributor

Gah, wow. My mistake. Good catch!

# Find the quarter being requested in the quarter we're
# crossing into.
requested_quarter = requested_qtr_data[
SHIFTED_NORMALIZED_QTRS

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

requested_qtr_data[SHIFTED_NORMALIZED_QTRS, sid]

# requested quarter, but simply the date we cross over into a
# new quarter.
next_qtr_start_idx = dates.searchsorted(
zero_qtr_data.loc[

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

zero_qtr_data.loc[row_indexer, EVENT_DATE_FIELD_NAME]

qtrs_with_estimates_for_sid = last_per_qtr.xs(
sid, axis=1, level=SID_FIELD_NAME
).groupby(axis=1, level=1).first().columns.values
for row_indexer in list(qtr_shifts.index):

This comment has been minimized.

@llllllllll

llllllllll Sep 20, 2016
Contributor

why are we calling list on this?

@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from da542ec to a3b7a65 Sep 22, 2016
@coveralls
Copy link

@coveralls coveralls commented Sep 22, 2016

Coverage Status

Coverage increased (+0.2%) to 86.699% when pulling 93bb4d0 on add_estimates_quarter_loader_to_pipeline into ca5f98b on master.

@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from 93bb4d0 to ea88195 Sep 27, 2016
Maya Tydykov and others added 14 commits Aug 9, 2016
MAINT: modify algorithm for calculating previous releases

BUG: fix quarter calculation logic
BUG: fix bugs in blaze loader

BUG: call correct method

MAINT: explicitly cast dates column

MAINT: modify code to comply with pandas 0.16.1
BUG: fix syntax error

MAINT: optimize code for cython
BUG: fix adjustment start index
TST: fix quarter normalization test

TST: change test name

BUG: remove arg

BUG: look at dict keys

TST: add test for windowing

MAINT: raise ValueError instead of asserting

TST: add assertion to check windowing

TST: parametrize test over number of quarters forward/back.

BUG: fix adjustment calculation logic for quarter crossovers.

TST: add test for previous quarter windows

BUG: fix bugs in calculating previous windows

BUG: fix missing value for datetime

TST: add test case for missing quarter
MAINT: pass column to name dict

MAINT: make check for invalid num columns py3-compatible
BUG: add cols for sids with no data and get adjustments outside column loop
MAINT: optimization - only look at assets appearing in data

TST: simplify test

DOC: add documentation for checkpoints

MAINT: explicitly cast event date field to datetime

MAINT: add back import

TST: fix indexing to remove setting wtih copy warning
BUG: choose last event date for quarter shift
@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from ea88195 to f18561b Sep 27, 2016
@coveralls
Copy link

@coveralls coveralls commented Sep 27, 2016

Coverage Status

Coverage increased (+0.2%) to 86.695% when pulling f18561b on add_estimates_quarter_loader_to_pipeline into 1942029 on master.

@coveralls
Copy link

@coveralls coveralls commented Sep 27, 2016

Coverage Status

Coverage increased (+0.1%) to 86.651% when pulling f18561b on add_estimates_quarter_loader_to_pipeline into 1942029 on master.

@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from f18561b to b380149 Sep 27, 2016
BUG: get column names from column dict

BUG: fix name map
@mtydykov mtydykov force-pushed the add_estimates_quarter_loader_to_pipeline branch from b380149 to f528c01 Sep 27, 2016
@coveralls
Copy link

@coveralls coveralls commented Sep 27, 2016

Coverage Status

Coverage increased (+0.2%) to 86.695% when pulling f528c01 on add_estimates_quarter_loader_to_pipeline into 4a22afd on master.

@mtydykov mtydykov merged commit 12357a8 into master Sep 27, 2016
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@richafrank richafrank deleted the add_estimates_quarter_loader_to_pipeline branch Nov 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.