Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pymod operation is missing null values #5938

Closed
galipremsagar opened this issue Aug 11, 2020 · 8 comments · Fixed by #12074
Closed

[BUG] Pymod operation is missing null values #5938

galipremsagar opened this issue Aug 11, 2020 · 8 comments · Fixed by #12074
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
When there is 0 in denominator of pymod operation, we seem to be returning garbage value instead of null value.

Steps/Code to reproduce bug

Type "help", "copyright", "credits" or "license" for more information.
>>> 1 % 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
>>> import pandas as pd
>>> x = pd.Series([1, 2, 3, 4, 0])
>>> x  % 2
0    1
1    0
2    1
3    0
4    0
dtype: int64
>>> 2 % x
0    0.0
1    0.0
2    2.0
3    2.0
4    NaN
dtype: float64

>>> import cudf
>>> x = cudf.from_pandas(x)
>>> x % 2
0    1
1    0
2    1
3    0
4    0
dtype: int64
>>> 2 % x
0             0
1             0
2             2
3             2
4    4294967295   #<- This could either be `nan` or `null`
dtype: int64

Expected behavior
Return either nan or null inplace of garbage values

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source (0.15)

Additional context
This kind of looks similar to #5722 but not sure if they share the same code-flows, so Just putting it out there.

@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify labels Aug 11, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 12, 2020
@kkraus14
Copy link
Collaborator

I guess it makes sense for a modulo zero operation to return null since it's undefined behavior but this is somewhere we would differ from Pandas.

@rgsl888prabhu rgsl888prabhu self-assigned this Aug 14, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@kkraus14
Copy link
Collaborator

cc @brandon-b-miller for more binaryop null things

@vyasr
Copy link
Contributor

vyasr commented Aug 4, 2022

Binary operations in libcudf are designed so that nulls always propagate. As a result, although we have created a specialized mod operator PYMOD in libcudf for compatibility with Python's rules for handling negatives, we can't trivially exploit that for null handling since the operator doesn't return a nullable value.

However, in this instance there is some precedent for solving this at the libcudf level. Specifically, all the Null* operators have special handling to bypass null propagation in the wrapper for ops. While I would normally say that we should solve this issue by post-processing the result in Python (similar to how #11441 solves #7389), the existing infrastructure for solving this problem makes me less wary of introducing this logic into C++.

@jrhemstad I'd be curious what you think about this. The tl;dr is that in pandas x % 0 returns nulls, and we're trying to figure out how best to do that. Options are (1) using the same wrapping logic that we currently use in libcudf for the Null* operators, which I find slightly icky (but only slightly since we're already doing this, it's just a little less obvious why you'd do this for an operator that isn't strictly about specialized null handling) or (2) just post-processing the result in Python based on whether the input values contained zeros.

CC @brandon-b-miller

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 4, 2022

although we have created a specialized mod operator PYMOD in libcudf for compatibility with Python's rules for handling negatives

"PYMOD" is a misnomer. It wasn't intended to be 100% compliant with Python's %. It was intended to just be the mathematical modulo operation vs C++'s % is the remainder operation.

See: https://stackoverflow.com/questions/13683563/whats-the-difference-between-mod-and-remainder

I wanted to rename MOD to REMAINDER and PYMOD would be MODULO.

While I would normally say that we should solve this issue by post-processing the result in Python

This is the correct answer. It's no different than #7389 (comment) imo.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 4, 2022

Found the original conversation: #1985 (review)

What's worse is calling an operation PyMod. It doesn't tell a user of the C++ interface what it actually means. In general, we should use descriptions that stand on their own and do not resort to knowledge of another language or library.

I still agree with my past self :)

@vyasr
Copy link
Contributor

vyasr commented Aug 4, 2022

I'm fine with that explanation, although to @harrism's point we'd definitely need to document how we define remainder vs modulo since it's at best it's a nonstandard but common distinction. So perhaps we should do that rename while also addressing this issue here :)

I think resolving this issue then requires two things:

  1. Rename PYMOD to REMAINDER (along with suitable documentation of the meaning)
  2. Add the same kind of logic as Return zero when floor dividing an integer data by zero #11441 except for modulo.

@brandon-b-miller would be up for handling this? I'm hoping it would be a near carbon copy of #11441 except with a different fill value. I can take on 1 myself if you don't want to.

@brandon-b-miller
Copy link
Contributor

happy to handle making the change 👍

wence- added a commit to wence-/cudf that referenced this issue Nov 8, 2022
wence- added a commit to wence-/cudf that referenced this issue Nov 11, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 16, 2022
The type normalisation applied before heading into libcudf previously had slightly unexpected consequences for large int64 values. If not providing a `cudf.Scalar`, a bare `int64` scalar would be cast to `uint64` and then normal numpy type promotion would unify to `float64`. This is lossy, since int64 to float64 is neither surjective nor injective.

To avoid this, try very hard to maintain the dtype of the object coming in, and match pandas behaviour by applying numpy type promotion rules via `numpy.result_type`. 

- Closes #5938
- Closes #7389
- Closes #12072
- Closes #12092

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #12074
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants