Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort pipeline data on asofdate #1710

Merged
merged 1 commit into from Mar 15, 2017
Merged

Sort pipeline data on asofdate #1710

merged 1 commit into from Mar 15, 2017

Conversation

@mtydykov
Copy link
Contributor

@mtydykov mtydykov commented Mar 14, 2017

Addresses an issue where values for 2 subsequent dates, d1 and d2, came in for the first time on d3 and had the same timestamp; because d1 happened to be in a later row than d2, last_in_date_group selected d1 to be the "latest" value for the next date in the index, which was d5. The fix for this issue is to sort the data on asof_date before we drop the asof_date column.

# If we ever have cases where we find out about multiple asof_dates'
# data on the same TS, we want to make sure that last_in_date_group
# selects the correct last asof_date's value.
sparse_output.sort_values(AD_FIELD_NAME, inplace=True)

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

should last_in_date_group just sort the input if it is not already sorted?

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

We could add some conditional logic to sort the input on some specified fields.

My thought here was that last_in_date_group should just expect data that's already sorted so that all it needs to do is to group and select that last item in each group, as the name of the method suggests.

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

Idk, it seems like we could have this bug at all of the other call sites for this function which makes me want to have the function check itself.

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

Other call sites may need to use different fields to sort the data, and so they would have to tell the function how to check itself anyway.

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

Then again, maybe having that extra, optional argument on that method would make people think about the sorting aspect when calling it. Either that, or we just add clarify in the docstring that the input df must be sorted so that the correct 'last' items are chosen. I think at this point I'm 50-50 on either option.

fields['other'] = fields['value']
expected = pd.DataFrame(
data=[[np.nan, np.nan], # 2014-01-02
[np.nan, np.nan], # 2014-01-03

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

the indentation here looks like it got messed up

"""
T = pd.Timestamp
df = pd.DataFrame(
columns=['asof_date', 'timestamp', 'other', 'value'],

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

what's with the whitespace here?


def test_id_take_last_in_group_sorted(self):
"""
output (expected):

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

can you also show the input?

@@ -1876,6 +1876,54 @@ def test_checkpoints_out_of_bounds(self):
self._test_checkpoints(checkpoints)


class LastInGroupSortingTestCase(BlazeToPipelineTestCase):

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

why is this in its own test suite?

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

Because I wanted a different set of dates to work with

This comment has been minimized.

@llllllllll

llllllllll Mar 14, 2017
Contributor

then you can pass your own dates around in the function instead of using self.dates.

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

Yeah, but then other helper functions like_test_id_macro use self.dates. I was trying not to change the testing structure around too much.

This comment has been minimized.

@mtydykov

mtydykov Mar 14, 2017
Author Contributor

Though I forgot that subclassing won't work here because it's trying to run all the other tests under this class.

@llllllllll
Copy link
Contributor

@llllllllll llllllllll commented Mar 14, 2017

looks like the blaze pipeline tests are now failing

@mtydykov
Copy link
Contributor Author

@mtydykov mtydykov commented Mar 14, 2017

Yeah, looks like they only fail with the addition of the new test - the fix itself doesn't seem to make them fail.

@coveralls
Copy link

@coveralls coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+0.002%) to 87.475% when pulling 7b1c2d5 on sort-pipeline-data-on-asofdate into 40aec72 on master.

@coveralls
Copy link

@coveralls coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+0.003%) to 87.477% when pulling 7b1c2d5 on sort-pipeline-data-on-asofdate into 40aec72 on master.

"""
input
asof_date timestamp other value
2014-01-02 2014-01-04 00 3 3

This comment has been minimized.

@llllllllll

llllllllll Mar 15, 2017
Contributor

this isn't the input, the point is that these are flipped.

This comment has been minimized.

@mtydykov

mtydykov Mar 15, 2017
Author Contributor

Yeah, good catch, flipped the order by accident.

@@ -316,6 +318,8 @@ def last_in_date_group(df,
)]]
if have_sids:
idx += [SID_FIELD_NAME]
if not extra_groupers:

This comment has been minimized.

@llllllllll

llllllllll Mar 15, 2017
Contributor

we probably want is None to be more clear and allow for empty sequences

Parameters
----------
df : pd.DataFrame
The DataFrame containing the data to be grouped.
The DataFrame containing the data to be grouped. Must be sorted so that

This comment has been minimized.

@llllllllll

llllllllll Mar 15, 2017
Contributor

+1

@@ -281,15 +281,17 @@ def last_in_date_group(df,
assets,
reindex=True,
have_sids=True,
extra_groupers=[]):
extra_groupers=None):

This comment has been minimized.

@llllllllll

llllllllll Mar 15, 2017
Contributor

alternativly, we could make this a tuple

@llllllllll
Copy link
Contributor

@llllllllll llllllllll commented Mar 15, 2017

lgtm, can you just squash this into 1 commit?

MAINT: fix arg default and update docstring
@mtydykov mtydykov force-pushed the sort-pipeline-data-on-asofdate branch from 695be96 to 6e4060f Mar 15, 2017
@mtydykov
Copy link
Contributor Author

@mtydykov mtydykov commented Mar 15, 2017

👍

@coveralls
Copy link

@coveralls coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+0.003%) to 87.477% when pulling 6e4060f on sort-pipeline-data-on-asofdate into 36e65d6 on master.

@coveralls
Copy link

@coveralls coveralls commented Mar 15, 2017

Coverage Status

Coverage increased (+0.003%) to 87.477% when pulling 6e4060f on sort-pipeline-data-on-asofdate into 36e65d6 on master.

@mtydykov mtydykov merged commit f445517 into master Mar 15, 2017
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mtydykov mtydykov deleted the sort-pipeline-data-on-asofdate branch Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.