New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve NAs when casting R data.frames to pandas. #1439
Conversation
This PR, will also cast With CRAN reticulate library(reticulate)
df <- data.frame(
string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#> string
#> 0 NA
#> 1 a
#> 2 b
#> 3 c
#> 4 d
p$string[0]
#> 'NA' With this PR: library(reticulate)
df <- data.frame(
string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#> string
#> 0 None
#> 1 a
#> 2 b
#> 3 c
#> 4 d
p$string[0]
#> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @dfalbel!
Can you please add a NEWS entry?
Seeing this makes me wonder if we should similarly take this opportunity to change NA_character
converting to Python None
outside of a data.frame as well. E.g., r_to_py(c(NA, letters[1:3]))
(or if pandas is not installed).
Yeah, I think we can do this in a future PR. We have to think if we want to be consistent with other data types, eg, what should we do with NA's in numeric, integers and logicals when outside a pandas data frame. And what's the desired behavior when getting back from R - should we simplify the vectors? |
Related #197 and #207
This PR adds an optional behavior for handling of R data.frame atomic columns when converting into pandas data.frames. The main motivation for this PR is to better handle R data frames that contain columns with
NA
s.If the
reticulate.pandas_use_nullable_dtypes
is set, instead of converting into raw numpy data types, we use (newish) pandas nullable data types, that have built-in support for NA`s .With the new behavior, doing:
Results in:
While with old behavior we get:
Preserving NA`s helps making round trip casts more reliable.
Benchmark shows twice as slow for a 100k row df, which is somewhat expected as we need to build a boolean mask for each column.
Code