New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/MAINT: Switch over to Google for benchmarking #1812

Merged
merged 4 commits into from Jun 5, 2017

Conversation

Projects
None yet
5 participants
@freddiev4
Contributor

freddiev4 commented May 24, 2017

Due to recent changes in the Yahoo API, we're switching to Google Finance for benchmarking data to SPY. The endpoint we're hitting is also undocumented so this will serve as just a fix until we can come up with a more robust solution.

We're using pandas-datareader to pull the data from Google Finance.

The one potential issue here is that the history that GF provides only goes back to around June 2001, and also data from 2009-08-11 and 2012-02-02 is missing. This breaks zipline backtests that need older benchmark data. One thought is to ffill for those dates but I'm not quite sure if that's an appropriate change.

x-ref #1805 and #1776

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch 8 times, most recently from 42a6cb9 to 7f484aa May 24, 2017

@freddiev4

This comment has been minimized.

Contributor

freddiev4 commented May 26, 2017

Tests currently failing and looking like this:

image

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch 4 times, most recently from 8b4e616 to 82c4ce1 May 26, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 1, 2017

Coverage Status

Coverage decreased (-0.02%) to 87.563% when pulling 82c4ce1 on google-finance-for-benchmarking into 9fe8076 on master.

@freddiev4 freddiev4 requested a review from richafrank Jun 1, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 1, 2017

Coverage Status

Coverage decreased (-0.02%) to 87.563% when pulling 47ef512 on google-finance-for-benchmarking into 9fe8076 on master.

@richafrank

Thanks @freddiev4 ! Had some questions for you.

@@ -14,6 +15,8 @@ from zipline.testing import test_resource_path, tmp_dir
from zipline.utils.cache import dataframe_cache
matplotlib.use('Agg')

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Is it new that matplotlib is required to rebuild the example data?

This comment has been minimized.

@freddiev4

freddiev4 Jun 1, 2017

Contributor

It's required to rebuild the example data only because the example algos call an analyze function, which plots graphs of performance results; for some reason (most likely because of our recent update to the matplotlib version) a matplotlib window will open up and the rebuild_example_data script waits on the user to manually close that window (thus killing that process) before it continues.

By explicitly stating the Agg backend, this issue doesn't occur and no matplotlib windows show up when running the script.

This comment has been minimized.

@richafrank
@@ -385,7 +385,7 @@ def test_benchmarkrange(self):
def test_partial_month(self):
start_session = self.trading_calendar.minute_to_session_label(
pd.Timestamp("1991-01-01", tz='UTC')
pd.Timestamp("1993-02-01", tz='UTC')

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Why this change?

This comment has been minimized.

@freddiev4

freddiev4 Jun 1, 2017

Contributor

I made this change because the furthest date for which we have data from Google Finance is 1993-02-01, so this test fails because it's looking for data at a start date before the oldest date.

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Oh, I thought it went back only to 2001?

The one potential issue here is that the history that GF provides only goes back to around June 2001, and also data from 2009-08-11 and 2012-02-02 is missing.

This comment has been minimized.

@freddiev4

freddiev4 Jun 1, 2017

Contributor

It does not. I have to change that comment as I had misunderstood how google provides data. The one important restriction is that we only can get the last 4000 days of data at a time from google.

I stitched multiple csvs of data all the way back to 1993, the furthest possible date google provides data for.

@@ -54,10 +53,7 @@ def init_class_fixtures(cls):
serialization='pickle',
)
copy_market_data(WithTradingEnvironment.MARKET_DATA_DIR,
cls.tmpdir.getpath('example_data/root'))

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Does this mean the tests will query google for the data instead of using the data saved in the repo?

logger.info('Downloading benchmark data for {symbol!r}.', symbol=symbol)
logger.info(
('Downloading benchmark data for {symbol!r} '
'from {first_date} to {last_date}'),

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

👍 for the improved message

@@ -260,7 +269,7 @@ def ensure_treasury_data(symbol, first_date, last_date, now, environ=None):
path.
"""
loader_module, filename, source = INDEX_MAPPING.get(
symbol, INDEX_MAPPING['^GSPC'],
symbol, INDEX_MAPPING['SPY'],

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Not a blocker, but I wonder if we should move this default benchmark symbol out into a global somewhere.

"&layout=seriescolumn"
"&type=package",
skiprows=1, # First row is a useless header.
skiprows=5, # First 5 rows are useless headers.

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Why are we changing the treasuries download? Is this related to the yahoo/google change?

This comment has been minimized.

@freddiev4

freddiev4 Jun 1, 2017

Contributor

This is just a problem I ran into and noticed when I wanted to be sure that the benchmarking works (by running a small algo I made locally). When I deleted the cached treasury_curves.csv from ~/.zipline, I ran into the 'Time Period' not in list error when trying to download new treasury curves data.

I downloaded a csv of the treasury data manually to check it and it has 5 useless headers at the beginning of the file. Removing those headers fixed the download issue.

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Aha, gotcha. And what's the deal with the label?

@@ -78,7 +78,7 @@ class TradingEnvironment(object):
def __init__(
self,
load=None,
bm_symbol='^GSPC',
bm_symbol='SPY',

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

Should we delete the old benchmark csv from the repo?

This comment has been minimized.

@freddiev4

freddiev4 Jun 1, 2017

Contributor

Yes. We should do that. We should also remove data in the example_data tarball for older versions of pandas.

This comment has been minimized.

@yiorgosn

yiorgosn Jun 6, 2017

If i may make one final recommendation on all the benchmark data as you program this. As a 20+ year hedge fund professional I can attest that the benchmark is good to have and good business practice to include it in your studies and presentations, so do NOT remove it like some have suggested. However to save valuable programming time and provide a good and accepted solution when it comes to missing data chunks (like we currently have with the time series from Google) filling in the partially missing data with the backfill function would be completely acceptable, so long as you alert the user and note it somewhere. This can be a third choice: if a file is only partially available online.
So a potential decision tree here regarding benchmarks could be:
A. See if available online,
B. see if the user can upload a csv,
C. if available online with NaN backfill,
D. if not available online but there is an old copy grab an old benchmark file and backfill the missing data for the time range you need and alert the user,
E. Last and Catch All Option, produced a benchmark file with a fixed rate of return set by the user, typically the risk free rate, if one is not available, default it to 5% annualized and alert the user of course.
Benchmarks should always be present but they don't always have to be exact and should NOT hold up the algorithm like they have now. Also, if you can, don't call it the S&P benchmark for file names and inside the code, call it the equity benchmark as you may choose to change it later on. I am new to github and python programming but you guys have done a stellar job here. Thank you!

cls.tmpdir.getpath('example_data/root'))
@parameterized.expand(examples.EXAMPLE_MODULES)
@parameterized.expand(sorted(examples.EXAMPLE_MODULES))

This comment has been minimized.

@richafrank

richafrank Jun 1, 2017

Member

👍 for consistent test order/names

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from 5d9555e to 7e3cc3d Jun 2, 2017

freddiev4 added some commits May 22, 2017

BUG/MAINT: Switch over to Google for benchmarking
MAINT: Remove mentions of Yahoo & ^GSPC

MAINT: Fill in missing dates

MAINT/BLD: Rebuild example data to match new benchmark

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from 7e3cc3d to 9984d84 Jun 2, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage decreased (-0.02%) to 87.565% when pulling 9984d84 on google-finance-for-benchmarking into 9fe8076 on master.

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from 9984d84 to b73c7de Jun 2, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage increased (+0.09%) to 87.671% when pulling b73c7de on google-finance-for-benchmarking into 9fe8076 on master.

@freddiev4

This comment has been minimized.

Contributor

freddiev4 commented Jun 2, 2017

@richafrank Added the ensure_file() call and removed the extra data we had saved

@richafrank

Thanks @freddiev4 . I only had one more issue.

@@ -50,6 +51,7 @@ def last_modified_time(path):
"""
Get the last modified time of path as a Timestamp.
"""
ensure_file(path)

This comment has been minimized.

@richafrank

richafrank Jun 2, 2017

Member

Oh, I think we don't want to do this, since it basically subverts the check below in production code. This function no longer just returns the last modified time, it first makes it "now". I think where we were copying over the market data previously (for tests only), we want to replace that with touching the market data.

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from b73c7de to 9d1eec6 Jun 2, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage increased (+0.08%) to 87.669% when pulling 9d1eec6 on google-finance-for-benchmarking into 9fe8076 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage increased (+0.08%) to 87.669% when pulling 54fc262 on google-finance-for-benchmarking into 9fe8076 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage decreased (-0.3%) to 87.304% when pulling ea03425 on google-finance-for-benchmarking into 9fe8076 on master.

@coveralls

This comment has been minimized.

coveralls commented Jun 2, 2017

Coverage Status

Coverage increased (+0.08%) to 87.669% when pulling ea03425 on google-finance-for-benchmarking into 9fe8076 on master.

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch 2 times, most recently from 4486860 to c3d49ef Jun 2, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 3, 2017

Coverage Status

Coverage increased (+0.08%) to 87.75% when pulling c3d49ef on google-finance-for-benchmarking into 3081386 on master.

cls.tmpdir.getpath('example_data/root'))
market_data = ('SPY_benchmark.csv', 'treasury_curves.csv')
for file in market_data:
ensure_file(cls.tmpdir.getpath('example_data/root/data/') + file)

This comment has been minimized.

@richafrank

richafrank Jun 5, 2017

Member

Just some nits now: Let's not name this variable file, so it doesn't mask the builtin. And let's use os.path.join instead of + to ensure the path separator is correct on windows and elsewhere - in fact, if we move + file into the getpath call, it looks like it uses os.path.join for us under the hood.

Otherwise LGTM!

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from c3d49ef to ccadf8f Jun 5, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 5, 2017

Coverage Status

Coverage decreased (-0.02%) to 87.644% when pulling ccadf8f on google-finance-for-benchmarking into 3081386 on master.

@freddiev4

This comment has been minimized.

Contributor

freddiev4 commented Jun 5, 2017

@richafrank fixed up stuff on your most recent comment!

MAINT: Skip more rows to match change in treasury data format
I'm not sure what the raw csv pulled from the federal reserve looked like before, but when trying to download fresh treasure data (data not stored in `./zipline`), there is an error that says "Time Period not in list". After checking the raw csv now, it looks like there are 5 header rows rather than just 1, so skipping those rows removes that error.

@freddiev4 freddiev4 force-pushed the google-finance-for-benchmarking branch from ccadf8f to a12c34c Jun 5, 2017

@coveralls

This comment has been minimized.

coveralls commented Jun 5, 2017

Coverage Status

Coverage increased (+0.08%) to 87.75% when pulling a12c34c on google-finance-for-benchmarking into 3081386 on master.

@freddiev4 freddiev4 merged commit cabe0b6 into master Jun 5, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@freddiev4 freddiev4 deleted the google-finance-for-benchmarking branch Jun 5, 2017

@freddiev4 freddiev4 referenced this pull request Jun 5, 2017

Closed

Benchmark troubles #271

@yiorgosn

This comment has been minimized.

yiorgosn commented Jun 6, 2017

Another benchmark error that has to do again with missing index data for the user specified date range: Apparently, very recent benchmark data is also missing with the current google solution such as yesterday's prices. Trying to simulate a portfolio on June 6, 2017 with the Quantopian-Quandl bundle that i ingested and verified that it includes closing prices for June 5th, 2017. The simulation can NOT be run because the benchmark loader throws an error due to the benchmark data file not including June 5th, 2017. The benchmark SPY or specifically the ^GSPC csv file actually gets written or modified today, June 6 2017, but it only includes benchmark data up until June 2nd 2017 (<-3999 entry). I checked manually on the google finance website and the data for June 5, 2017 is available. If i do simulations not including yesterday's date for periods when i know the benchmark data is available from 2013 - 6-2-2017 all the simulation runs fine. As I mentioned in my earlier post if the benchmark modules can just fill in missing data this should not be an issue.

Here is the error:
Traceback (most recent call last):
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1296, in _has_valid_type
error()
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1291, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [2017-06-05 00:00:00+00:00] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\George\Anaconda3\envs\py34\Scripts\zipline-script.py", line 11, in
load_entry_point('zipline==1.0.2', 'console_scripts', 'zipline')()
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\core.py", line 722, in call
return self.main(*args, **kwargs)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\core.py", line 697, in main
rv = self.invoke(ctx)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\core.py", line 1066, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\core.py", line 535, in invoke
return callback(*args, **kwargs)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline_main
.py", line 97, in _
return f(*args, **kwargs)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline_main_.py", line 240, in run
environ=os.environ,
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline\utils\run_algo.py", line 180, in _run
overwrite_sim_params=False,
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline\algorithm.py", line 688, in run
for perf in self.get_generator():
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline\gens\tradesimulation.py", line 228, in transform
handle_benchmark(normalize_date(dt))
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline\gens\tradesimulation.py", line 188, in handle_benchmark
benchmark_source.get_value(date)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\zipline\sources\benchmark_source.py", line 73, in get_value
return self._precalculated_series.loc[dt]
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1198, in getitem
return self._getitem_axis(key, axis=0)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1342, in _getitem_axis
self._has_valid_type(key, axis)
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1304, in _has_valid_type
error()
File "C:\Users\George\Anaconda3\envs\py34\lib\site-packages\pandas\core\indexing.py", line 1291, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [2017-06-05 00:00:00+00:00] is not in the [index]'

@freddiev4

This comment has been minimized.

Contributor

freddiev4 commented Jun 6, 2017

Hey @yiorgosn thanks for posting about this. I also noticed this issue earlier today and am working on a fix 😄

@yiorgosn

This comment has been minimized.

yiorgosn commented Jun 8, 2017

@freddiev4 Thank you! Changes made to fix the benchmark missing data and code error issues worked.
Files I updated\downloaded to make exisning zipline 1.1.0 work:
\data\benchmarks.py
\data\loader.py
\finance\trading.py
\utils\date_utils.py

WARNING: Right now June 8, 2017 if one does a channel installation with: conda install -c Quantopian zipline=1.1.0 the following files crash the code:
\source\benchmark_source.py file dated 6/8/2017 (when using the old files 3/11/17 it works)
\source\requests_csv.py file dated 6/8/2017 (when using the old files 3/11/17) it works)

Finally, i discovered if one makes the following modification it eliminates the benchmark date range error, if it is not the same as the user date range, and it no longer holds up the simulation.
Modify: ./zipline/gens/tradesimulation.py
and set:
def handle_benchmark(date, benchmark_source=self.benchmark_source):
algo.perf_tracker.all_benchmark_returns[date] = 1
this way it doesn't check if there is data, and just always returns 1 for a benchmark

@saitam1

This comment has been minimized.

saitam1 commented Jul 1, 2017

I am having this error when running the example buyapple.py

request.py line 1320, in do_open raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>

Does it have to do with the Yahoo Finance API change?

@freddiev4

This comment has been minimized.

Contributor

freddiev4 commented Jul 3, 2017

@saitam1 what is the command you ran for the buyapple.py example algo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment