Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: make the data loading apis more consistent. #1109

Merged
merged 1 commit into from Apr 16, 2016

Conversation

@llllllllll
Copy link
Contributor

@llllllllll llllllllll commented Apr 4, 2016

Changes BcolzDailyBarWriter to not be an abc, data is passed as an
iterator of (sid, dataframe) pairs to the write method.

Changes the AssetsDBWriter to be a single class which accepts an engine
at construction time and has a write method for writing dataframes for
the various tables. We no longer support writing the various other data
types, callers should coerce their data into a dataframe themselves. See
zipline.assets.synthetic for some helpers to do this.

Adds a lot of test fixtures and updates many tests to use these. This
was done in the process of updating the tests that used one of the
writers above.

ref: #1108. #1100

@llllllllll llllllllll force-pushed the make-loader-apis-more-clear branch 2 times, most recently from 8561ca4 to 9f2bd48 Apr 5, 2016
@coveralls
Copy link

@coveralls coveralls commented Apr 5, 2016

Coverage Status

Coverage decreased (-0.09%) to 84.131% when pulling 9f2bd48 on make-loader-apis-more-clear into 34f47da on master.

cls.bcolz_minute_bar_days[0],
cls.bcolz_minute_bar_days[-1],
)
for sid in (1, cls.SPLIT_ASSET_SID)

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

Might be nicer to use cls.ASSET1 and cls.ASSET2 here, since we're using symbolic names for the others.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

it would need to be ASSET1_SID which is just 1, idk if that actually helps

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

It just seems strange to me to refer to some of the assets symbolically but refer to others by number. I don't care strongly though.

)

return BcolzMinuteBarReader(cls.tempdir.path)
cls.ASSETS = [cls.ASSET1, cls.ASSET2]

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

It seems a bit odd that this is cls.ASSETS when we have two other named assets here.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

this is how it was, I imagine some tests use only those two

@@ -382,8 +371,12 @@ def test_spot_price_is_unadjusted(self):
def test_spot_price_is_adjusted_if_needed(self):
# on cls.days[1], the first 9 minutes of ILLIQUID_SPLIT_ASSET are
# missing. let's get them.
day0_minutes = self.env.market_minutes_for_day(self.days[0])

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

Another thing we could have done here would be to set self.days in the testcase's init_{instance,class}_fixtures. That way the tests still get a nice variable name, but we don't clutter the global fixture namespace. Probably not worth changing now though.

cls.env.trading_days,
MockDailyBarReader()
def make_equity_info(cls):
return pd.DataFrame.from_dict(

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

Is there a reason this isn't just make_simple_equity_info with a symbols list?

mergers = pd.DataFrame([
@classmethod
def make_mergers_data(cls):
return pd.DataFrame([

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

Is there a reason we changed make_splits_data to DataFrame.from_records, but left this unchanged?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

probably not, I think the from_records is a bit more clear here. What do you think it should be?

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

I don't care deeply either way, but it seems like we should be consistent here. Maybe slight preference for from_records?

# correspond to 2% and 4% dividends, respectively.
dividends = pd.DataFrame([
@classmethod
def make_dividends_data(cls):

This comment has been minimized.

@ssanderson

ssanderson Apr 5, 2016
Contributor

ditto above note.

zipline.testing.MockDailyBarReader
"""
@classmethod
def _make_data(cls):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Thoughts on making this provide empty frames with the right dtypes instead of None as the value for "no data"?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

idk, I think None to signal empty input is reasonable.

@classmethod
def init_class_fixtures(cls):
super(WithAdjustmentReader, cls).init_class_fixtures()
cls.adjustment_db_conn = conn = sqlite3.connect(':memory:')

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Does anyone use this attribute? Could we just make it an arg to make_adjustment_writer rather than setting it on the class? The only case where that wouldn't work would be if we wanted to test writing into an already-populated db, but in that case we probably shouldn't use the fixture.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

that sounds good.

WithBcolzDailyBarReader
WithDataPortal
"""
BCOLZ_MINUTE_BAR_PATH = 'minute_equity_pricing.bcolz'

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Are there tests that need to configure this?

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

might be worth throwing a test in the name here somewhere so it's clearer where this came from if you stumble upon it

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

this would be a subdir of one of the tmpdirs so idk if people will be confused.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

also, I don't think any tests currently configure this but I think it is nice to have control over the shared tmpdir for a suite

cls.bcolz_minute_bar_path = p = cls.tmpdir.makedir(
cls.BCOLZ_MINUTE_BAR_PATH,
)
cls.bcolz_minute_bar_days = days = (

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

I think you just wanted an if-statement here...

cls.env.open_and_closes.market_close.loc[days],
US_EQUITIES_MINUTES_PER_DAY
)
cls.bcolz_minute_bar_data = df_dict = cls.make_minute_bar_data()

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

The multiple assignment here feels like more complexity than it's worth.

cls.bcolz_daily_bar_path = p = cls.tmpdir.makedir(
cls.BCOLZ_DAILY_BAR_PATH,
)
cls.bcolz_daily_bar_days = days = (

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

I think this just wanted to be an if-statement.

"""
BCOLZ_DAILY_BAR_PATH = 'daily_equity_pricing.bcolz'
BCOLZ_DAILY_BAR_LOOKBACK_DAYS = 0
BCOLZ_DAILY_BAR_FROM_CSVS = False

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

This option isn't documented. When is it used?

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Actually, it looks like all these config options need docs.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

there is a test that loads test data from a directory mapping sid->path-to-csv

complete_fill = params.get('complete_fill')

sid = 1
metadata = pd.DataFrame.from_dict(

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Could this just be a make_simple_equity_info call?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

sure

data_frequency="minute",
env=cls.env
@classmethod
def make_equity_info(cls):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

It seems like most of the invocations of make_equity_info could be converted into make_simple_equity_info calls.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

these equities have custom symbols. I do agree that most calls will probably want to be a make_*_equity_info call though.

@@ -465,31 +327,25 @@ def make_trade_data_for_asset_info(dates,
volumes = (volume_sid_deltas + volume_date_deltas[:, None]) + volume_start

for j, sid in enumerate(sids):
start_date = asset_info[sid]['start_date']
end_date = asset_info[sid]['end_date']
start_date = asset_info.loc[sid, 'start_date']

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

👍

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

I also sometimes write this specific pattern as start_date, end_date = df.loc[sid, ['start_date', 'end_date']].

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

ooh, I like that a lot. thank you

date_field: dates_arr
})

if frequency == 'minute':

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Looks like we weren't setting the index before if the frequency was 'daily'. Was that just a bug?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

no, I changed this to use the index instead of the name day to make them the same

)


def make_simple_asset_info(assets, start_date, end_date, symbols=None):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

This function was never actually used.

return df


def write_minute_data_for_asset(env, writer, start_dt, end_dt, sid,

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

This seems like it's just shuffling arguments around. Is this just for backward compat with something else?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

yeah, this existed and then I split it into a function that creates the data without writing.


def write_minute_data_for_asset(env, writer, start_dt, end_dt, sid,
interval=1, start_val=1):
writer.writer(

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Also, should this be writer.write? (Is this actually called anywhere?)

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

this is likely dead code because there is no writer method here. I will check and probably remove.



def create_daily_df_for_asset(env, start_day, end_day, interval=1):
days = env.days_in_range(start_day, end_day)
days_count = len(days)
days_arr = np.array(range(2, days_count + 2))
days_arr = np.arange(2, days_count + 2)

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Even simpler here might be np.arange(days_count) + 2

"low": lows,
"close": closes,
"volume": volumes,
"day": [day.value for day in index]

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Not your doing, but this is just index.values, right?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

I don't think we even need the day field here. buy yeah

[],
dtype=[
('effective_date', 'int64'),
('ratio', 'float32'),

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Should this be float64?

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

idk, I will check with the schema



@contextmanager
def patch_os_environment(remove=None, **values):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

I don't think this is used anywhere.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

This was added to test the new feature that this change started on. We will use this soon.

os.environ[old_key] = old_value


class tmp_dir(object):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

This might not actually be an improvement over what it's wrapping.

--------
tmp_bcolz_daily_bar_reader
"""
_reader = BcolzMinuteBarReader

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

I might call this _reader_cls to be clear that it's the type, not the reader itself.

This comment has been minimized.

@llllllllll

llllllllll Apr 6, 2016
Author Contributor

kk

@@ -126,7 +152,7 @@ def enter_instance_context(self, context_manager):
return self._instance_teardown_stack.enter_context(context_manager)

@final
def add_instance_callback(self, callback):
def add_instance_callback(self, callback, *args, **kwargs):

This comment has been minimized.

@ssanderson

ssanderson Apr 6, 2016
Contributor

Commented this on the original PR: we're not forwarding *args/**kwargs here.

@@ -224,18 +272,36 @@ class WithAssetFinder(object):
zipline.testing.make_future_info
zipline.testing.make_commodity_future_info
"""
ASSET_FINDER_EQUITY_SIDS = tuple(map(ord, 'ABC'))

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

This is substantially less clear than just ord('A'), ord('B'), ord('C'). In particular, it's not obvious at a glance that there are 3 sids here.

def ASSET_FINDER_EQUITY_END_DATE(cls):
# add one day to end to make sure the asset exists through the end
# of the data
return cls.END_DATE + pd.Timedelta(days=1)

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Is there a reason we add a 1 day buffer at the end but not the start?

This comment has been minimized.

@llllllllll

llllllllll Apr 8, 2016
Author Contributor

I don't think assets are available on their end date, I think it is [start, end). I could be wrong in which case I will remove the extra day.

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

end_date is the last day on which we have trades for an asset.

BCOLZ_DAILY_BAR_USE_FULL_CALENDAR = False
BCOLZ_DAILY_BAR_START_DATE = alias('START_DATE')
BCOLZ_DAILY_BAR_END_DATE = alias('END_DATE')
_write_method_name = 'write'

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Might be worth commenting that this exists to be overridden in a subclass, and that tests that want to change the write method should inherit from the other mixin rather than overridding this attribute (which is why the name is lowercase).

This comment has been minimized.

@llllllllll

llllllllll Apr 8, 2016
Author Contributor

I think the bigger indication that this is not the supported way to modify this is that it is private. I can add a quick comment explaining why this exists though.

@@ -399,82 +390,3 @@ def _update_buffer(self, frame):
self.buffer.loc[non_nan_items, :, non_nan_cols])

self.buffer = new_buffer


class SortedDict(MutableMapping):

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Are we no longer using this?

This comment has been minimized.

@llllllllll

llllllllll Apr 8, 2016
Author Contributor

dead

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

RIP in pieces

@@ -130,6 +133,19 @@ def ensure_timezone(func, argname, arg):
)


def expect_non_empty(func, argname, arg):

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

So far all of the other functions here named expect_<thing> take **kwargs that get mapped to the arguments. That seems like a worthwhile pattern to maintain.

Maybe this should be expect_length_between and take **lower_and_upper_bounds as pairs of ints?

This comment has been minimized.

@llllllllll

llllllllll Apr 8, 2016
Author Contributor

I thought the more important pattern was between expect and ensure. I could do that but I still would want a named version for non_empty which is a more common case imho

)
args_defaults.append((varkw, NO_DEFAULT))

argset = set(args) | {varargs, varkw} - {None}

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Can we put parens on the RHS here? I don't have immediate intuition for whether | or - binds tighter.

This comment has been minimized.

@llllllllll

llllllllll Apr 8, 2016
Author Contributor

this is the same regardless of how it binds.

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

fair point


def argname(arg):
return star_map.get(arg, '') + arg

for arg, default in args_defaults:

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Should we just keymap argname over args_defaults?

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

Ah, nvm, there's an if-statement where we need the bound name below.

varkw: '**',
}

def argname(arg):

This comment has been minimized.

@ssanderson

ssanderson Apr 8, 2016
Contributor

I might call this name_as_arg to help clarify that this is the name as it will appear in the signature, as distinguished from the name as it will be used in the body of the function.

@llllllllll llllllllll force-pushed the make-loader-apis-more-clear branch 12 times, most recently from 1dad8a6 to c96262d Apr 13, 2016
@coveralls
Copy link

@coveralls coveralls commented Apr 15, 2016

Coverage Status

Coverage decreased (-0.3%) to 84.022% when pulling c96262d on make-loader-apis-more-clear into 85a2f6f on master.

@llllllllll llllllllll force-pushed the make-loader-apis-more-clear branch from c96262d to 38ce51f Apr 15, 2016
@coveralls
Copy link

@coveralls coveralls commented Apr 15, 2016

Coverage Status

Coverage decreased (-0.3%) to 84.026% when pulling 38ce51f on make-loader-apis-more-clear into 85a2f6f on master.

@llllllllll llllllllll force-pushed the make-loader-apis-more-clear branch from 38ce51f to d906124 Apr 15, 2016
@coveralls
Copy link

@coveralls coveralls commented Apr 15, 2016

Coverage Status

Coverage decreased (-0.2%) to 84.03% when pulling d906124 on make-loader-apis-more-clear into 85a2f6f on master.

@llllllllll llllllllll force-pushed the make-loader-apis-more-clear branch from d906124 to bc0b117 Apr 16, 2016
Changes BcolzDailyBarWriter to not be an abc, data is passed as an
iterator of (sid, dataframe) pairs to the write method.

Changes the AssetsDBWriter to be a single class which accepts an engine
at construction time and has a `write` method for writing dataframes for
the various tables. We no longer support writing the various other data
types, callers should coerce their data into a dataframe themselves. See
zipline.assets.synthetic for some helpers to do this.

Adds many new fixtures and updates some existing fixtures to use the new
ones:

WithDefaultDateBounds
  A fixture that provides the suite a START_DATE and END_DATE. This is
  meant to make it easy for other fixtures to synchronize their date
  ranges without depending on eachother in strange ways. For example,
  WithBcolzMinuteBarReader and WithBcolzDailyBarReader by default should
  both have data for the same dates, so they may use depend on
  WithDefaultDates without forcing a dependency between them.

WithTmpDir, WithInstanceTmpDir
  Provides the suite or individual test case a temporary directory.

WithBcolzDailyBarReader
  Provides the suite a BcolzDailyBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from
  dataframes and then converted to bcolz files with
  BcolzDailyBarWriter.write

WithBcolzDailyBarReaderFromCSVs
  Provides the suite a BcolzDailyBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from a
  collection of CSV files and then converted into the bcolz data through
  BcolzDailyBarWriter.write_csvs

WithBcolzMinuteBarReader
  Provides the suite a BcolzMinuteBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from
  dataframes and then converted to bcolz files with
  BcolzMinuteBarWriter.write

WithAdjustmentReader
  Provides the suite a SQLiteAdjustmentReader which reads from an in
  memory sqlite database. The data will be read from dataframes and then
  converted into sqlite with SQLiteAdjustmentWriter.write

WithDataPortal
  Provides each test case a DataPortal object with data from temporary
  resources.
@coveralls
Copy link

@coveralls coveralls commented Apr 16, 2016

Coverage Status

Coverage decreased (-0.2%) to 84.049% when pulling bc0b117 on make-loader-apis-more-clear into 8c64cc8 on master.

@llllllllll llllllllll merged commit bc0b117 into master Apr 16, 2016
1 of 2 checks passed
1 of 2 checks passed
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@llllllllll llllllllll deleted the make-loader-apis-more-clear branch Apr 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.