Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve NAs when casting R data.frames to pandas. #1439

Merged
merged 9 commits into from Aug 15, 2023
Merged

Conversation

dfalbel
Copy link
Member

@dfalbel dfalbel commented Aug 10, 2023

Related #197 and #207

This PR adds an optional behavior for handling of R data.frame atomic columns when converting into pandas data.frames. The main motivation for this PR is to better handle R data frames that contain columns with NAs.

If the reticulate.pandas_use_nullable_dtypes is set, instead of converting into raw numpy data types, we use (newish) pandas nullable data types, that have built-in support for NA`s .

With the new behavior, doing:

df <- data.frame(
  int = c(NA, 1:4),
  num = c(NA, rnorm(4)),
  lgl = c(NA, rep(c(TRUE, FALSE), 2)),
  string = c(NA, letters[1:4])
)
options(reticulate.pandas_use_nullable_dtypes = TRUE)
r_to_py(df)

Results in:

    int       num    lgl string
0  <NA>      <NA>   <NA>   <NA>
1     1  0.491903   True      a
2     2   1.54789  False      b
3     3  0.470248   True      c
4     4  0.125063  False      d

While with old behavior we get:

          int       num    lgl string
0 -2147483648       NaN   True   None
1           1  0.282169   True      a
2           2 -0.001555  False      b
3           3 -0.190735   True      c
4           4  0.678618  False      d

Preserving NA`s helps making round trip casts more reliable.

Benchmark shows twice as slow for a 100k row df, which is somewhat expected as we need to build a boolean mask for each column.

# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time       gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>     <list>  
1 old           3.2ms   3.49ms      280.    54.8KB     2.95    95     1      339ms <NULL> <Rprofmem [22 × 3]> <bench_tm> <tibble>
2 new          6.72ms   6.97ms      143.    54.8KB     0       72     0      504ms <NULL> <Rprofmem [22 × 3]> <bench_tm> <tibble>
Code
df <- data.frame(
  int = c(NA, 1:4),
  num = c(NA, rnorm(4)),
  lgl = c(NA, rep(c(TRUE, FALSE), 2)),
  string = c(NA, letters[1:4])
)


library(dplyr)
devtools::load_all()

set.seed(100)
x <- df %>% sample_n(100000, replace = TRUE)

bench::mark(check = FALSE,
  old = withr::with_options(c(reticulate.pandas_force_numpy = TRUE), r_to_py(x)),
  new = withr::with_options(c(reticulate.pandas_force_numpy = FALSE), r_to_py(x))
)

@dfalbel
Copy link
Member Author

dfalbel commented Aug 10, 2023

This PR, will also cast NA_character_ into None when converting R data.frames into Pandas data.frames - when forcing the numpy casts with the option specified above. The old behavior would convert into the string 'NA' possibly causing data loss.

With CRAN reticulate

library(reticulate)
df <- data.frame(
  string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#>   string
#> 0     NA
#> 1      a
#> 2      b
#> 3      c
#> 4      d
p$string[0]
#> 'NA'

With this PR:

library(reticulate)
df <- data.frame(
  string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#>   string
#> 0   None
#> 1      a
#> 2      b
#> 3      c
#> 4      d
p$string[0]
#> None

Copy link
Member

@t-kalinowski t-kalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @dfalbel!

Can you please add a NEWS entry?

Seeing this makes me wonder if we should similarly take this opportunity to change NA_character converting to Python None outside of a data.frame as well. E.g., r_to_py(c(NA, letters[1:3])) (or if pandas is not installed).

@dfalbel
Copy link
Member Author

dfalbel commented Aug 15, 2023

Yeah, I think we can do this in a future PR. We have to think if we want to be consistent with other data types, eg, what should we do with NA's in numeric, integers and logicals when outside a pandas data frame. And what's the desired behavior when getting back from R - should we simplify the vectors?

@t-kalinowski t-kalinowski merged commit dd2529b into main Aug 15, 2023
12 checks passed
@dfalbel dfalbel deleted the pandas-na branch August 15, 2023 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants