-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: Require sample weights to sum to less than 1 when replace = True #61582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Require sample weights to sum to less than 1 when replace = True #61582
Conversation
1281002
to
1d1f742
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but the condition needs adjusted. Will also need a note in the whatsnew for 3.0 under the Other section.
pandas/core/sample.py
Outdated
if weight_sum > 1 and replace: | ||
raise ValueError( | ||
"Invalid weights: If `replace`=True weights must sum to less than 1" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct condition. See the formula in the OP of the linked issue.
pandas/core/generic.py
Outdated
When replace = True will not allow weights that add up to less | ||
than 1, to avoid biased results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When replace = True will not allow weights that add up to less | |
than 1, to avoid biased results. | |
When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, | |
to avoid biased results. |
2bc3618
to
e5f9a07
Compare
688cabe
to
2aacee7
Compare
4dbeaf3
to
1dfc85d
Compare
9f62a65
to
11df613
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
if weights is not None: | ||
is_max_weight_dominating = size * weights.max() > 1 | ||
if is_max_weight_dominating and not replace: | ||
raise ValueError( | ||
"Invalid weights: If `replace`=False, " | ||
"total unit probabilities have to be less than 1" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Avoid O(n) computation if it is not needed.
- Combine with the
if
block above; no need to checkif weights is not None
twice. - Suggestion on a more clear error message.
if weights is not None: | |
is_max_weight_dominating = size * weights.max() > 1 | |
if is_max_weight_dominating and not replace: | |
raise ValueError( | |
"Invalid weights: If `replace`=False, " | |
"total unit probabilities have to be less than 1" | |
) | |
if not replace and size * weights.max() > 1: | |
raise ValueError( | |
"Weighted sampling cannot be achieved with replace=False. Either " | |
"set replace=True or use smaller weights. See the docstring of sample " | |
"for details." | |
) |
@@ -5815,6 +5815,8 @@ def sample( | |||
If weights do not sum to 1, they will be normalized to sum to 1. | |||
Missing values in the weights column will be treated as zero. | |||
Infinite values not allowed. | |||
When replace = False will not allow ``(n * max(weights) / sum(weights)) > 1``, | |||
in order to avoid biased results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add "See the Notes below for more details." and then add a bit more detail in the notes section. I can give it a shot if you'd prefer.
# edge case, n*max(weights)/sum(weights) == 1 | ||
edge_variance_weights = [1] * 10 | ||
edge_variance_weights[0] = 9 | ||
# should not raise | ||
obj.sample(n=2, weights=edge_variance_weights, replace=False) | ||
|
||
low_variance_weights = [1] * 10 | ||
low_variance_weights[0] = 8 | ||
# should not raise | ||
obj.sample(n=2, weights=low_variance_weights, replace=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make these two separate tests.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.