New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Reduces memory footprint of Quandl WIKI Prices bundle #2053

Merged
merged 1 commit into from Dec 20, 2017

Conversation

Projects
None yet
3 participants
@ernestoeperez88
Member

ernestoeperez88 commented Dec 14, 2017

Refactored some of the helper methods in the Quandl bundle to be more memory efficient. Memory peaks now at ~2500 MiB, with a constant memory consumption of ~2000 MiB.

Filename: /Users/ernestoeperez88/zipline_master/zipline/data/bundles/quandl.py

Line #    Mem usage    Increment   Line Contents
================================================
   186    167.3 MiB    167.3 MiB   @bundles.register('quandl')
   187                             @profile
   188                             def quandl_bundle(environ,
   189                                               asset_db_writer,
   190                                               minute_bar_writer,
   191                                               daily_bar_writer,
   192                                               adjustment_writer,
   193                                               calendar,
   194                                               start_session,
   195                                               end_session,
   196                                               cache,
   197                                               show_progress,
   198                                               output_dir):
   199                                 """
   200                                 quandl_bundle builds a data bundle using Quandl's WIKI Prices dataset.
   201                             
   202                                 For more information on Quandl's API and how to obtain an API key,
   203                                 please visit https://docs.quandl.com/docs#section-authentication
   204                                 """
   205    167.3 MiB      0.0 MiB       raw_data = fetch_data_table(
   206    167.3 MiB      0.0 MiB           environ.get('QUANDL_API_KEY'),
   207    167.3 MiB      0.0 MiB           show_progress,
   208   1666.3 MiB   1499.0 MiB           environ.get('QUANDL_DOWNLOAD_ATTEMPTS', 5)
   209                                 )
   210   1666.3 MiB      0.0 MiB       asset_metadata = gen_asset_metadata(
   211   1666.4 MiB      0.1 MiB           raw_data[['symbol', 'date']],
   212   2430.2 MiB    763.8 MiB           show_progress
   213                                 )
   214   2202.9 MiB   -227.3 MiB       asset_db_writer.write(asset_metadata)
   215                             
   216   2202.9 MiB      0.0 MiB       symbol_map = asset_metadata.symbol
   217   2202.9 MiB      0.0 MiB       sessions = calendar.sessions_in_range(start_session, end_session)
   218                             
   219   2261.1 MiB     58.2 MiB       raw_data.set_index(['date', 'symbol'], inplace=True)
   220   2261.1 MiB      0.0 MiB       daily_bar_writer.write(
   221   2261.1 MiB      0.0 MiB           parse_pricing_and_vol(
   222   2261.1 MiB      0.0 MiB               raw_data,
   223   2261.1 MiB      0.0 MiB               sessions,
   224   2261.1 MiB      0.0 MiB               symbol_map
   225                                     ),
   226   1426.1 MiB   -835.0 MiB           show_progress=show_progress
   227                                 )
   228                             
   229   1904.5 MiB    478.4 MiB       raw_data.reset_index(inplace=True)
   230   1942.8 MiB     38.3 MiB       raw_data['symbol'] = raw_data['symbol'].astype('category')
   231   1942.8 MiB      0.0 MiB       raw_data['sid'] = raw_data.symbol.cat.codes
   232   1942.8 MiB      0.0 MiB       adjustment_writer.write(
   233   1942.8 MiB      0.0 MiB           splits=parse_splits(
   234   1942.8 MiB      0.0 MiB               pd.DataFrame(
   235   1942.8 MiB      0.0 MiB                   raw_data[[
   236   1942.8 MiB      0.0 MiB                       'sid',
   237   1942.8 MiB      0.0 MiB                       'date',
   238   1942.8 MiB      0.0 MiB                       'split_ratio',
   239   1942.8 MiB      0.0 MiB                   ]].loc[raw_data.split_ratio != 1]
   240                                         ),
   241   1942.8 MiB      0.0 MiB               show_progress=show_progress
   242                                     ),
   243   1942.8 MiB      0.0 MiB           dividends=parse_dividends(
   244   1942.8 MiB      0.0 MiB               pd.DataFrame(
   245   1942.8 MiB      0.0 MiB                   raw_data[[
   246   1942.8 MiB      0.0 MiB                       'sid',
   247   1942.8 MiB      0.0 MiB                       'date',
   248   1854.0 MiB    -88.8 MiB                       'ex_dividend',
   249   1872.4 MiB     18.4 MiB                   ]].loc[raw_data.ex_dividend != 0]
   250                                         ),
   251   1506.9 MiB   -365.5 MiB               show_progress=show_progress
   252                                     )
   253                                 )

@ernestoeperez88 ernestoeperez88 requested a review from freddiev4 Dec 14, 2017

@coveralls

This comment has been minimized.

coveralls commented Dec 19, 2017

Coverage Status

Coverage increased (+0.002%) to 87.547% when pulling a4cf026 on quandl_bundle_perf_enh into fdfce9b on master.

if show_progress:
log.info('Parsing raw data.')
data_table = pd.read_csv(
table_file,

This comment has been minimized.

@prsutherland

prsutherland Dec 19, 2017

Member

Is table_file readable outside of the above with block? I'd also recommend using the zip_file.open as a context manager: https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.open

This comment has been minimized.

@ernestoeperez88

ernestoeperez88 Dec 20, 2017

Member

table_file is readable outside the with block, zip_file is the one that is no longer accessible.

I'll write it as shown in the link/docs to keep things consistent and readable.

adjustment_writer.write(
splits=parse_splits(
raw_data[['symbol', 'date', 'split_ratio']],
pd.DataFrame(

This comment has been minimized.

@prsutherland

prsutherland Dec 19, 2017

Member

I think the inner loc should return a DataFrame in the calls to parse_splits and parse_dividends below

data.columns = data.columns.get_level_values(0)
data['exchange'] = 'QUANDL'
data['auto_close_date'] = \

This comment has been minimized.

@prsutherland

prsutherland Dec 19, 2017

Member

Minor thing, but I don't think the line needs to be wrapped any more.

This comment has been minimized.

@ernestoeperez88

ernestoeperez88 Dec 20, 2017

Member

oh you are right. I renamed the DataFrame here and it gave me just enough space to fit it in one line.

@ernestoeperez88 ernestoeperez88 force-pushed the quandl_bundle_perf_enh branch from a4cf026 to 7affefa Dec 20, 2017

@coveralls

This comment has been minimized.

coveralls commented Dec 20, 2017

Coverage Status

Coverage increased (+0.002%) to 87.546% when pulling 7affefa on quandl_bundle_perf_enh into 52826bf on master.

@prsutherland

LGTM

@ernestoeperez88 ernestoeperez88 merged commit 4bfdbd9 into master Dec 20, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ernestoeperez88 ernestoeperez88 deleted the quandl_bundle_perf_enh branch Dec 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment