Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Expression.minmax() not returning proper value for datetimes #1469

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kmcentush
Copy link
Contributor

@kmcentush kmcentush commented Jul 17, 2021

Fixes #1456

I left some comments in the #development Slack channel, as it looks like min(), max() and minmax() were using different approaches. I'm not sure which is more efficient/robust long-term, but I opted for the simpler solution.

Note: In this solution, StatOpMinMax is no longer used anywhere in the code base.

@kmcentush kmcentush changed the title 馃悰 xpression.minmax() not 馃悰 Expression.minmax() not returning proper value for datetimes Jul 17, 2021
@@ -509,3 +509,11 @@ def test_agg_count_with_custom_name():
df_grouped = df.groupby(df.x, sort=True).agg({'mycounts': vaex.agg.count(), 'mycounts2': 'count'})
assert df_grouped['mycounts'].tolist() == [3, 2, 1]
assert df_grouped['mycounts2'].tolist() == [3, 2, 1]


def test_minmax_ts(df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that in addition to this, the test asserts for the min and max against the actual min and max values. It could happen that minmax is consistent with min and min but still objectively wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Not sure why I didn't have that check in the first commit.

@kmcentush
Copy link
Contributor Author

Somehow missed all of those tests failing. Will check again on this PR later tonight.

@maartenbreddels
Copy link
Member

yeah, seems odd, you could try a rebase.. although I don't remember seeing these issues before.

@kmcentush
Copy link
Contributor Author

The latest commit should fix the issue. I was returning a list instead of the np array like before. Also, my previous implementation didn't respect delayed execution, so I changed the code to use the previous structure and just swapped out the self.limits() call for min() and max(), respectively.

This may suggest that the root case bug in limits() (or more likely, StatOpMinMax?) is still present for my added test case. But as stated above in the original PR comment, this now makes StatOpMinMax dead code.

@kmcentush
Copy link
Contributor Author

Not sure why it's failing on non-Windows builds... The errors in the Linux and Mac builds are notebook timeout errors.

@JovanVeljanoski
Copy link
Member

Ah don't worry about it. Those errors are unrelated. I hope we can clean them up soon.
I don't think that the notebooks run on the windows CI which is why that part does not fail :).

Looks good to me! But let's wait for @maartenbreddels to also give his blessing :)

Thanks!

@maartenbreddels
Copy link
Member

Thanks for this PR!
However, this reminds me that we actually need to have a minmax aggregator, now we lose a bit of performance by doing a min and a max separately.
@JovanVeljanoski what do you think?

@JovanVeljanoski
Copy link
Member

Ups I completely forgot about this one :(

Well, indeed it would be great to do min and max in one pass over the data. If one is interested only in one of them, is there any performance benefit to doing only min for example? @maartenbreddels

@maartenbreddels
Copy link
Member

Yes, we have both min, max and minmax for that reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG-REPORT] Using a factor for grouping by unit of time results in empty DataFrame
3 participants