New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup retrieve all #817

Merged
merged 18 commits into from Nov 16, 2015

Conversation

Projects
None yet
4 participants
@ssanderson
Member

ssanderson commented Nov 5, 2015

Speed up AssetFinder.retrieve_all by doing one query instead of N.

  • Removes the separate caches for equities and futures in favor of a single asset cache.
  • We now cache the results of failed lookups. This required unfortunate hacks in the TradingAlgorithm code that resolves assets from DataFrames.
  • Reimplements single-asset lookups as a special case of multiple-asset lookups.

@ssanderson ssanderson force-pushed the speedup-retrieve-all branch from 68cfde9 to c98d959 Nov 5, 2015

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 5, 2015

Timings for ~30000 assets on my machine:

Before:

# Empty Cache
In [3]: %timeit -r1 -n1 f.retrieve_all(sids)
1 loops, best of 1: 29.4 s per loop

# Full Cache
In [5]: %timeit f.retrieve_all(sids)
100 loops, best of 3: 9.13 ms per loop

# Single Asset
In [4]: %timeit f.retrieve_asset(24)
The slowest run took 42.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 278 ns per loop

After:

# Empty Cache
In [5]: %timeit -n5 f.retrieve_all(sids); f._reset_caches()
5 loops, best of 3: 1.34 s per loop

# Full Cache
In [7]: %timeit f.retrieve_all(sids)
100 loops, best of 3: 5.96 ms per loop

# Single Asset
In [3]: %timeit f.retrieve_asset(24)
The slowest run took 49.35 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 546 ns per loop

Conclusions:

  • This is roughly 30x faster for batch lookups on a large number of assets when caches are empty.
  • This is roughly 2x faster for batch lookups when caches are full.
  • This is roughly 2x slower for single-asset lookups. Those lookups go from 270 ns to 540 ns on my machine. I think this is okay because that performance should only matter appreciably if we're looking up lots of assets in a loop, in which case we should just use the faster batch mode anyway. We can add more special-case handling for single-asset cache hits if we need to though.

cc @jbredeche on perf notes.

@@ -604,8 +605,7 @@ def _write_and_map_id_index_to_sids(self, identifiers, as_of_date):
if isinstance(identifier, Asset):
asset = self.asset_finder.retrieve_asset(sid=identifier.sid,
default_none=True)
elif hasattr(identifier, '__int__'):
elif isinstance(identifier, Integral):

This comment has been minimized.

@richafrank

richafrank Nov 5, 2015

Member

What's the impact of this change?

This comment has been minimized.

@ssanderson

ssanderson Nov 6, 2015

Member

We use Integral to define AssetConvertible in assets.py, so this seems more in line with the definition there.

One notable difference is that this will reject floats as inputs, which seems more correct to me. If you pass 3.5 as your index, the existing behavior would truncate to 3 and then convert it to an asset, which seems like unfortunate behavior.

This comment has been minimized.

@richafrank

richafrank Nov 6, 2015

Member

That makes sense to me.

(equities_cols.share_class_symbol == share_class_symbol)
).execute().fetchall()
sids = list(imap(
itemgetter('sid'),

This comment has been minimized.

@richafrank

richafrank Nov 5, 2015

Member

Given that this is all about performance, any idea how this compares to a list comprehension?

sids = [row['sid'] for row in sa.select(...).execute().fetchall()]

(Same question below)

This comment has been minimized.

@ssanderson

ssanderson Nov 6, 2015

Member

@richafrank I'd expect the difference here to be pretty negligible since this is only loading ~5-10 assets at most. I didn't touch much of the by-symbol lookup logic, but there are definitely a lot of improvements that could be made there.

This comment has been minimized.

@richafrank

richafrank Nov 6, 2015

Member

Oh interesting. Why is this case at most 5-10 assets?

This comment has been minimized.

@ssanderson

ssanderson Nov 6, 2015

Member

This is in lookup_symbol. We're only resolving sids for assets that share a symbol. I think the most-shared symbol in our upstream database has something like 12 assets with that symbol?

This comment has been minimized.

@richafrank

richafrank Nov 6, 2015

Member

Ah, got it. Thanks!

@richafrank

This comment has been minimized.

Member

richafrank commented Nov 5, 2015

Nice conclusions! Is the single-asset lookup case for cache-miss or cache-hit?

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 6, 2015

@richafrank the single-asset case is about the same 2x slowdown for both hits and misses. The perf is closer on misses though, since they're both making queries of similar cost.

@ssanderson ssanderson force-pushed the speedup-retrieve-all branch 3 times, most recently from 71a76f2 to 9881c30 Nov 6, 2015

@ssanderson ssanderson assigned mtydykov and unassigned llllllllll Nov 13, 2015

@ssanderson

This comment has been minimized.

Member

ssanderson commented Nov 13, 2015

@mtydykov can you take a look at the commits from a436426 to 9881c30? The rest of this PR has already been reviewed by @richafrank, but I made a couple changes after rebasing against your latest changes.

@mtydykov

This comment has been minimized.

Contributor

mtydykov commented Nov 13, 2015

LGTM

ssanderson added some commits Oct 29, 2015

PERF: Cache the result of failed lookups.
Otherwise we'll keep trying to look them up.
BUG: Raise if we fail to symbol map a Frame/Panel.
Previously, we just warned here, but then we'd fail immediately trying
to write the found values into an index that's too long.
BUG: Clear asset caches when mapping DataFrame.
Our DataFrame index resolution logic relies on failed lookups **not**
being cached, but not caching failed lookups is a nontrivial performance
hit when repeatedly looking up sids.  The "solution" here is to clear
the caches after writing in new assets.

The real fix for this is either:

1. Don't construct an AssetFinder until we have the datasource in hand
   in run(), or
2. Don't symbol-map the user's input source if it's a DataFrame.
   Instead we should make our data loaders pre-map the data.
ENH: Make retrieve specific type functions public.
We rely on these upstream, for better or worse, so add tests and docs.

Also adds distinct `EquitiesNotFound` and `FutureContractsNotFound`
exceptions.
MAINT: Don't double-coerce to Timestamp.
`normalize` would already fail if the input isn't a Timestamp.

ssanderson added some commits Nov 12, 2015

MAINT: De-dupe shared filtering logic.
We filter for assets that were alive at some reference date in multiple
places.
TEST: Test fallback to start_date/end_date sorting.
Adds tests asserting that we resolve conflicts in accordance with the
following rules when we have multiple assets holding the same symbol at
the same time:

If multiple SIDs exist for symbol S at time T, return the candidate
SID whose start_date is highest. (200 cases)

If multiple SIDs exist for symbol S at time T, the best candidate
SIDs share the highest start_date, return the SID with the highest
end_date. (34 cases)

It is the opinion of the author (ssanderson) that we should consider
this malformed input and fail here.  But this is the current indended
behavior of the code, and I accidentally broke it while refactoring.
These will serve as regression tests until the time comes that we
decide to enforce this as an error.

See #837 for more
details.

@ssanderson ssanderson force-pushed the speedup-retrieve-all branch from 9881c30 to 4832004 Nov 13, 2015

ssanderson added a commit that referenced this pull request Nov 16, 2015

@ssanderson ssanderson merged commit fed6655 into master Nov 16, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the speedup-retrieve-all branch Nov 16, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment