New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: create adjustments based on ordered assets instead of set #1547

Merged
merged 3 commits into from Oct 24, 2016

Conversation

Projects
None yet
3 participants
@mtydykov
Contributor

mtydykov commented Oct 19, 2016

This fixes a bug where we create adjustments for estimates based on where assets happen to fall in a set rather than using ordered assets. Not sure if this necessitates a test or what kind.

@mtydykov mtydykov force-pushed the fix-asset-selection-bug branch from e322f7f to e3243d6 Oct 19, 2016

@@ -424,6 +424,7 @@ def load_adjusted_array(self, columns, dates, assets, mask):
out = {}
# To optimize performance, only work below on assets that are
# actually in the raw data.
import pdb; pdb.set_trace()

This comment has been minimized.

@llllllllll
@llllllllll

This comment has been minimized.

Member

llllllllll commented Oct 19, 2016

Why does the order matter here. I think a test would help me understand how this change affects the results, and help us ensure that we don't break this later.

@mtydykov

This comment has been minimized.

Contributor

mtydykov commented Oct 19, 2016

Order matters because when we create adjustments, we do the following to figure out at what column the adjustment should be applied:

sid_to_idx = dict(zip(assets, range(len(assets))))

Because we were passing in a set of assets that are used, if the resulting map happened to put the sids out of order, adjustments got misaligned with the base data later.

So actually, I don't have it right in the fix because I'm passing an ordered subset of the assets (just the assets we use). The indexes we use to create adjustments need to line up with the entire set of assets. I just updated the code with the correct fix.

I agree, I just can't think of any kind of test for this other than figuring out in what way we can create a set out of a list of sids so that they end up out of order.

@coveralls

This comment has been minimized.

coveralls commented Oct 19, 2016

Coverage Status

Coverage remained the same at 86.893% when pulling a3fe67d on fix-asset-selection-bug into 7c72eef on master.

@mtydykov

This comment has been minimized.

Contributor

mtydykov commented Oct 20, 2016

I updated the existing adjustments test to add gaps between sids, which addresses 1/2 issues. It addresses the issue where we were only using a subset of the assets (the 'used' assets) to compute column indexes for applying adjustments. The test now fails when the bug is present.

However, I still don't know how to test order in any meaningful way. The order issue came in when a qexec test I had written was passing sids [2, 24], which happened to be ordered as {24:0, 2:1} after applying set and then dict(zip(assets, range(len(assets)))). The way the sids happen to be ordered with the bug in place depends on set implementation details. Not sure we want to tie our test data to that.

@coveralls

This comment has been minimized.

coveralls commented Oct 20, 2016

Coverage Status

Coverage remained the same at 86.893% when pulling f177c36 on fix-asset-selection-bug into 7c72eef on master.

sid_10_timeline,
sid_20_timeline]).reset_index()
return concatted.reindex(np.random.permutation(concatted.index))

This comment has been minimized.

@llllllllll

llllllllll Oct 20, 2016

Member

we shouldn't use unseeded random in a test

This comment has been minimized.

@mtydykov

mtydykov Oct 20, 2016

Contributor

Added a seed

df = pd.concat([df, pd.DataFrame(columns=new_sids)])
# Have to explicitly cast to float so that nans are of type
# np.float64 - otherwise they won't compare equal.
for col in new_sids:

This comment has been minimized.

@llllllllll

llllllllll Oct 20, 2016

Member

can you explain why we are doing this?

This comment has been minimized.

@mtydykov

mtydykov Oct 20, 2016

Contributor

Why we are explicitly setting values to np.float64? That's in the comment above. Or something else?

This comment has been minimized.

@llllllllll

llllllllll Oct 20, 2016

Member

I understand the mechanics of what you are doing, I don't understand why this is needed at all.

This comment has been minimized.

@mtydykov

mtydykov Oct 20, 2016

Contributor

Still unsure of what you mean. Do you mean why we need sids in assets that aren't used in the raw data?

This comment has been minimized.

@llllllllll

llllllllll Oct 20, 2016

Member

I think all of this can be replaced with df = df.reindex(columns=df.columns.union(new_sids))

This comment has been minimized.

@mtydykov

mtydykov Oct 20, 2016

Contributor

I had tried doing that before, but it doesn't work for the reason I explained in the comment. Here's what happens:

(Pdb++) type(estimate[0][0])
<type 'numpy.float64'>
(Pdb++) type(today_timeline[timeline_start_idx:][0][0])
<type 'float'>
(Pdb++) assert_equal(estimate,today_timeline[timeline_start_idx:])
*** AssertionError:
Arrays are not equal

(mismatch 100.0%)
 x: array([[ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
         nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,...
 y: array([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,...

This comment has been minimized.

@llllllllll

llllllllll Oct 21, 2016

Member

I don't understand why the types are off:

In [1]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype='f8')

In [2]: df.reindex(columns=['a', 'b', 'c'])
Out[2]: 
     a    b   c
0  1.0  4.0 NaN
1  2.0  5.0 NaN
2  3.0  6.0 NaN

In [3]: _.dtypes
Out[3]: 
a    float64
b    float64
c    float64
dtype: object

also, in the astype below, use float64 or float32 to be explicit.

This comment has been minimized.

@mtydykov

mtydykov Oct 21, 2016

Contributor

Updated the cast

@coveralls

This comment has been minimized.

coveralls commented Oct 20, 2016

Coverage Status

Coverage remained the same at 86.893% when pulling 2bae129 on fix-asset-selection-bug into 7c72eef on master.

@coveralls

This comment has been minimized.

coveralls commented Oct 21, 2016

Coverage Status

Coverage increased (+0.07%) to 86.96% when pulling e6cf6fc on fix-asset-selection-bug into 7c72eef on master.

@coveralls

This comment has been minimized.

coveralls commented Oct 21, 2016

Coverage Status

Coverage increased (+0.07%) to 86.96% when pulling f9c8d9d on fix-asset-selection-bug into 7c72eef on master.

TST: update adjustment tests - add gaps between sids
TST: add a seed for permuting

@mtydykov mtydykov force-pushed the fix-asset-selection-bug branch from f9c8d9d to e1f008e Oct 21, 2016

@coveralls

This comment has been minimized.

coveralls commented Oct 21, 2016

Coverage Status

Coverage remained the same at 86.96% when pulling e1f008e on fix-asset-selection-bug into e68b772 on master.

@mtydykov mtydykov merged commit 54ebd9e into master Oct 24, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mtydykov mtydykov deleted the fix-asset-selection-bug branch Oct 24, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment