New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix winsorize bug #2138

Merged
merged 2 commits into from Apr 5, 2018

Conversation

Projects
None yet
3 participants
@ssanderson
Member

ssanderson commented Apr 4, 2018

Fixes a bug where Factor.winsorize counted NaN values as being larger
than all other data points when calculating percentiles.

In practice, this means that most calls to winsorize did not clip
upper-percentile values and were more lenient than expected with
lower-percentile values.

@ssanderson ssanderson requested a review from ahgnaw Apr 4, 2018

@ssanderson ssanderson force-pushed the fix-winsorize-bug branch from 4a1a02f to 50d6fe4 Apr 4, 2018

@ahgnaw

ahgnaw approved these changes Apr 4, 2018

A small typo, otherwise LGTM

nan_count = isnan(row).sum()
nonnan_count = a.size - nan_count
# NOTE: argsort() sorts nans to the end of the array.

This comment has been minimized.

@ahgnaw

ahgnaw Apr 4, 2018

Contributor

👍

the value at the minimum percentile. Similarly, values ranking above
the maximum percentile are changed to the value at the maximum
percentile. Winsorizing is often useful for limiting the impact of
extreme data pointswithout completely removing those points.

This comment has been minimized.

@ahgnaw

ahgnaw Apr 4, 2018

Contributor

space needed pointswithout -> points without

If ``groupby`` is supplied, compute by partitioning each row based on
the values produced by ``groupby``, winsorizing the partitioned arrays,
and stitching the sub-results back together.
If ``groupby`` is supplied, winsorization is applied separately

This comment has been minimized.

@ahgnaw

ahgnaw Apr 4, 2018

Contributor

Much better explanation here.

DOC: Update docstring for Factor.winsorize.
Fixed a few typos, and changed some wording slightly.

@ssanderson ssanderson force-pushed the fix-winsorize-bug branch 2 times, most recently from 6c88762 to da347bb Apr 5, 2018

@coveralls

This comment has been minimized.

coveralls commented Apr 5, 2018

Coverage Status

Coverage increased (+0.06%) to 87.111% when pulling da347bb on fix-winsorize-bug into 52b3329 on master.

@coveralls

This comment has been minimized.

coveralls commented Apr 5, 2018

Coverage Status

Coverage increased (+0.06%) to 87.111% when pulling 62f5715 on fix-winsorize-bug into 52b3329 on master.

BUG: Fix bug in Factor.winsorize with nan inputs.
Fixes a bug where `Factor.winsorize` counted NaN values as being larger
than all other data points when calculating percentiles.

In practice, this means that most calls to `winsorize` did not clip
upper-percentile values and were more lenient than expected with
lower-percentile values.

@ssanderson ssanderson force-pushed the fix-winsorize-bug branch from da347bb to 62f5715 Apr 5, 2018

@ssanderson ssanderson merged commit 42a857d into master Apr 5, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the fix-winsorize-bug branch Apr 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment