New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory savings #1599

Merged
merged 5 commits into from Nov 28, 2016

Conversation

Projects
None yet
3 participants
@ssanderson
Member

ssanderson commented Nov 22, 2016

No description provided.

@ssanderson ssanderson force-pushed the memory-savings branch 2 times, most recently from 14ea967 to e02fa8b Nov 22, 2016

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 22, 2016

These are all relatively minor in the grand scheme of things. The overwhelming consumer of memory in zipline simulations that touch many assets are the active blocks of minutely bcolz files.

@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2016

Coverage Status

Coverage decreased (-0.03%) to 87.093% when pulling e02fa8b on memory-savings into f3a36fe on master.

@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2016

Coverage Status

Coverage decreased (-0.03%) to 87.093% when pulling e02fa8b on memory-savings into f3a36fe on master.

ssanderson added some commits Nov 15, 2016

PERF: Deterministically GC pipeline results.
Any DataFrame that's had `.loc` or `.iloc `called on it participates in
a cycle, which means they're not immediately garbage collected when they
go out of scope.  This matters for pipeline results because they consume
multiple megabytes per column, which means that a pipeline result with
many columns can hold take up over 100MB.  By manually breaking
DataFrame cycles, we can ensure that we never hold multiple pipeline
results in memory at once.
PERF: Use searchsorted instead of get_loc.
On pandas < 18, `get_loc` triggers allocation of a large hash table, so
we don't want to call get_loc on minutely `DatetimeIndex`es.

@ssanderson ssanderson force-pushed the memory-savings branch from e02fa8b to 7a5f5da Nov 22, 2016

@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2016

Coverage Status

Coverage decreased (-0.03%) to 87.071% when pulling 7a5f5da on memory-savings into 4174a09 on master.

----------
df : pd.DataFrame
"""
for attr in ('_loc', '_iloc'):

This comment has been minimized.

@ehebert

ehebert Nov 22, 2016

Member

How about iterating over the following in case an ix, at or iat are ever used.

name_getter = itemgetter(0)
for attr in ('_' + name for name in map(get_name, pandas.core.indexing.get_indexers_list())):
   ...
@@ -380,7 +380,7 @@ def _ensure_sliding_windows(self, assets, dts, field,
assets = self._asset_finder.retrieve_all(assets)
try:
end_ix = self._calendar.get_loc(end)
end_ix = self._calendar.searchsorted(end)

This comment has been minimized.

@ehebert

ehebert Nov 22, 2016

Member

Thanks for spotting this one. I was unaware of that bug/behavior.

On the same track, I do question whether we should pull in all minutes ever from the calendar (mea culpa), in favor of calculating the location in a similar fashion to the minute bar reader's find position of minute.
(The minute index takes 11MB at minimum for 14 years just for the datetime values.)

Does using getitem also trigger the creation of the hash table?
If so we should also audit usage of trading_calendar.previous_minute.

This comment has been minimized.

@ssanderson

ssanderson Nov 22, 2016

Member

getitem is just an array lookup.

I don't think we're spending an appreciable amount of time searching in these arrays (last I looked, the overwhelming bottleneck in this code is concatenating all the history sliding windows), so I'm not sure it's worth the refactoring cost to change this more. Even the get_loc here turned out to not be as bad as I expected (it's only about 50MB for the equity minutes. The noticeable memory regression from the pandas upgrade was in the futures calendar which is packed much more densely)

This comment has been minimized.

@ssanderson

ssanderson Nov 22, 2016

Member

You can check if the index has allocated a hashtable by looking at index._engine.mapping:

In [1]: dts
Out[1]:
DatetimeIndex(['2014-01-03 14:31:00+00:00', '2014-01-03 14:32:00+00:00', '2014-01-03 14:33:00+00:00', '2014-01-03 14:34:00+00:00', '2014-01-03 14:35:00+00:00', '2014-01-03 14:36:00+00:00',
               '2014-01-03 14:37:00+00:00', '2014-01-03 14:38:00+00:00', '2014-01-03 14:39:00+00:00', '2014-01-03 14:40:00+00:00',
               ...
               '2014-01-04 20:51:00+00:00', '2014-01-04 20:52:00+00:00', '2014-01-04 20:53:00+00:00', '2014-01-04 20:54:00+00:00', '2014-01-04 20:55:00+00:00', '2014-01-04 20:56:00+00:00',
               '2014-01-04 20:57:00+00:00', '2014-01-04 20:58:00+00:00', '2014-01-04 20:59:00+00:00', '2014-01-04 21:00:00+00:00'],
              dtype='datetime64[ns, UTC]', length=1830, freq='T')

In [2]: dts._engine.mapping

In [3]: dts.get_loc('2014')
Out[3]: slice(0, 1830, None)

In [4]: dts._engine.mapping
Out[4]: <pandas.hashtable.Int64HashTable at 0x7fe9cfcbd790>
@@ -20,7 +20,7 @@ class Expired(Exception):
"""
class CachedObject(namedtuple("_CachedObject", "value expires")):

This comment has been minimized.

@ehebert

ehebert Nov 22, 2016

Member

Why the change from using namedtuple.

This comment has been minimized.

@ssanderson

ssanderson Nov 22, 2016

Member

@llllllllll and I had some concerns about the behavior of garbage collection on objects with empty slots. See the discussion in https://bugs.python.org/issue24379.

The specific issue, I think, is that objects with empty __slots__ don't participate in garbage collection, which means they can leak if they participate in a cycle. I don't think that's happening here, but there's also not much benefit in making this a subclass of namedtuple.

@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2016

Coverage Status

Coverage increased (+0.02%) to 87.119% when pulling 16e78bb on memory-savings into 4174a09 on master.

@coveralls

This comment has been minimized.

coveralls commented Nov 22, 2016

Coverage Status

Coverage increased (+0.02%) to 87.12% when pulling 9f28b7c on memory-savings into 4174a09 on master.

@ssanderson ssanderson merged commit 84e7c03 into master Nov 28, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the memory-savings branch Nov 28, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment