New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treasury cleanup #793

Merged
merged 14 commits into from Oct 25, 2015

Conversation

Projects
None yet
2 participants
@ssanderson
Member

ssanderson commented Oct 22, 2015

Rewrites most of load_market_data and the functions it calls.

The bulk of the work here is in rewriting the treasury loader and benchmark loader to use pd.read_csv with standard arguments instead of our own homegrown parser. See the message for 5381f14 for a detailed analysis of the treasury loader change.

The benchmark returns loader has some more significant deviations from the previous behavior, but I don't see anything wrong with the code. @ehebert do you know of a good place to find reference values for S&P500 returns to check my current results against the previously-generated results?

@ehebert

This comment has been minimized.

Member

ehebert commented Oct 23, 2015

One idea for double checking against old calculations would be to look at the output/cache in ~/.zipline/data using the different versions.

Loosely:

  • Checkout master.
  • rm ~/.zipline/data/\^GSPC_benchmark.csv
  • Run unit tests.
  • cp ~/.zipline/data/\^GSPC_benchmark.csv ~/.zipline/data/\^GSPC_benchmark.csv.bak
  • rm ~/.zipline/data/\^GSPC_benchmark.csv
  • Checkout new branch.
  • Compare the two files.
@ehebert

This comment has been minimized.

Member

ehebert commented Oct 23, 2015

Overall LGTM, mostly just a few style questions with one real question about whether or not we should convert US/Eastern and then back to UTC before applying the trading day offset.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Oct 23, 2015

Checkout master.
rm ~/.zipline/data/^GSPC_benchmark.csv
Run unit tests.
cp ~/.zipline/data/^GSPC_benchmark.csv ~/.zipline/data/^GSPC_benchmark.csv.bak
rm ~/.zipline/data/^GSPC_benchmark.csv
Checkout new branch.
Compare the two files.

I've already done that in both cases. For the treasuries there's exactly one case where there's a potentially significant (on the order of 0.002) difference for a single date on the 3month curve, which is due to the fact that data.treasury.gov and federalreserve.gov produce different results there. I'm not sure how to verify that one or the other is correct.

For the returns benchmarks there are more nontrivial differences, which is why I was looking for a canonical place to compare our current return calculations vs the old ones.

ssanderson added some commits Oct 22, 2015

MAINT: Remove default values from dump_treasury_curves.
We never call the function without passing them explicitly.
MAINT: Just do searchsorted with the date.
Previously we were converting our date to a string, then calling
`searchsorted` on the DatetimeIndex with the string, which would cause
pandas to convert the string back into a date to actually do the lookup.
ENH: Rewrite treasury loader using pandas.
Replaces our custom XML parsing with a single call to `pd.read_csv`
against the federal reserve's API.  This produces nearly identical
results as compared to the old loader, but it's dramatically simpler and
roughly 10x faster on my machine.

The average difference in magnitude between new and old is approximately
10e-7, and only one entry is different to a degree greater than the
number of significant figures provided by treasury.gov.

Additionally, the new loader correctly ignores Columbus Day of 2010, for
which the old loader erroneously produced an all-NaN row.

This also changes the interface that treasury modules modules are
required to implement. Modules must now supply a `get_treasury_data`
function that returns a `DataFrame` with a daily `DatetimeIndex` and a
column for each supported treasury duration.

Detailed comparison between results from new and old loader::

    from zipline.data.treasuries import get_treasury_data
    new = get_treasury_data() # New implementation
    old = pd.read_csv(  # Previously cached data
        '/home/ssanderson/.zipline/data/treasury_curves.csv'
        parse_dates=[0],
        index_col=0,
    )
    # These columns were unused.
    del old['tid']; del old['date']
    old = old.tz_localize('UTC')
    old.dropna(how='all')
    # old data erroneously contained an all-NaN entry for Columbus Day
    # in 2010.  Remove before comparing.
    old = old.dropna(how='all')

    In [25]: len(new) == len(old)
    Out[25]: True

    In [26]: abs(old - new).max()
    Out[26]:
    10year    2.000000e-04
    1month    6.938894e-18
    1year     1.000000e-04
    20year    1.000000e-04
    2year     2.000000e-04
    30year    1.000000e-04
    3month    1.000000e-03
    3year     1.000000e-04
    5year     1.387779e-17
    6month    1.000000e-04
    7year     1.000000e-04
    dtype: float64

    In [27]: abs(old - new).mean()
    Out[27]:
    10year    3.097414e-08
    1month    4.396534e-19
    1year     1.548707e-08
    20year    3.624502e-08
    2year     4.646120e-08
    30year    1.830496e-08
    3month    1.549427e-07
    3year     1.548707e-08
    5year     1.702619e-18
    6month    1.548707e-08
    7year     1.548707e-08
    dtype: float64

Since www.treasury.gov only reports values up to three significant
digits, we should only care about differences of greater than 1e-3.

There is exactly one such difference: the entry for the three month bond
on 1999-10-01::

    In [60]: new[(abs(new - old) >= 1e-3).any(axis=1)].T
    Out[60]:
    Time Period  1999-10-01 00:00:00+00:00
    1month                             NaN
    3month                          0.0498
    6month                          0.0501
    1year                           0.0530
    2year                           0.0573
    3year                           0.0583
    5year                           0.0590
    7year                           0.0622
    10year                          0.0600
    20year                          0.0657
    30year                          0.0615

    In [61]: old[(abs(new - old) >= 1e-3).any(axis=1)].T
    Out[61]:
            1999-10-01 00:00:00+00:00
            10year                     0.0600
            1month                        NaN
            1year                      0.0530
            20year                     0.0657
            2year                      0.0573
            30year                     0.0615
            3month                     0.0488
            3year                      0.0583
            5year                      0.0590
            6month                     0.0501
            7year                      0.0622

The US Treasury website (our old source) provides a value of 0.488 here,
whereas the Federal Reserve site (our new source) provides a value of
0.498.
MAINT: Final polish on loader rewrites.
- Fixes an issue with the canadian treasury loader where it would never
  have enough data to not redownload because it can only download data
  in the last 10 years.
- Uses module objects directly instead of lazy imports.
- Adds lots of docstrings.
ENH: Always use Adjusted Close for benchmarks.
Previously we were using Close, and we calculated returns on the first
day of a window against the Open for that day.  We now always look back
an extra day to get the previous day's close.
BUG: Better check for last date.
Use get_loc to find the trading day that ended 2 days before now.

@ssanderson ssanderson force-pushed the treasury-cleanup branch from 0c3a224 to 0188891 Oct 25, 2015

ssanderson added a commit that referenced this pull request Oct 25, 2015

@ssanderson ssanderson merged commit e49cc3a into master Oct 25, 2015

2 checks passed

Scrutinizer 3 updated code elements
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the treasury-cleanup branch Oct 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment