Preserve NAs when casting R data.frames to pandas. #1439

dfalbel · 2023-08-10T13:51:51Z

Related #197 and #207

This PR adds an optional behavior for handling of R data.frame atomic columns when converting into pandas data.frames. The main motivation for this PR is to better handle R data frames that contain columns with NAs.

If the reticulate.pandas_use_nullable_dtypes is set, instead of converting into raw numpy data types, we use (newish) pandas nullable data types, that have built-in support for NA`s .

With the new behavior, doing:

df <- data.frame(
  int = c(NA, 1:4),
  num = c(NA, rnorm(4)),
  lgl = c(NA, rep(c(TRUE, FALSE), 2)),
  string = c(NA, letters[1:4])
)
options(reticulate.pandas_use_nullable_dtypes = TRUE)
r_to_py(df)

Results in:

    int       num    lgl string
0  <NA>      <NA>   <NA>   <NA>
1     1  0.491903   True      a
2     2   1.54789  False      b
3     3  0.470248   True      c
4     4  0.125063  False      d

While with old behavior we get:

          int       num    lgl string
0 -2147483648       NaN   True   None
1           1  0.282169   True      a
2           2 -0.001555  False      b
3           3 -0.190735   True      c
4           4  0.678618  False      d

Preserving NA`s helps making round trip casts more reliable.

Benchmark shows twice as slow for a 100k row df, which is somewhat expected as we need to build a boolean mask for each column.

# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time       gc      
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>     <list>  
1 old           3.2ms   3.49ms      280.    54.8KB     2.95    95     1      339ms <NULL> <Rprofmem [22 × 3]> <bench_tm> <tibble>
2 new          6.72ms   6.97ms      143.    54.8KB     0       72     0      504ms <NULL> <Rprofmem [22 × 3]> <bench_tm> <tibble>

Code

df <- data.frame(
  int = c(NA, 1:4),
  num = c(NA, rnorm(4)),
  lgl = c(NA, rep(c(TRUE, FALSE), 2)),
  string = c(NA, letters[1:4])
)


library(dplyr)
devtools::load_all()

set.seed(100)
x <- df %>% sample_n(100000, replace = TRUE)

bench::mark(check = FALSE,
  old = withr::with_options(c(reticulate.pandas_force_numpy = TRUE), r_to_py(x)),
  new = withr::with_options(c(reticulate.pandas_force_numpy = FALSE), r_to_py(x))
)

dfalbel · 2023-08-10T13:58:15Z

This PR, will also cast NA_character_ into None when converting R data.frames into Pandas data.frames - when forcing the numpy casts with the option specified above. The old behavior would convert into the string 'NA' possibly causing data loss.

With CRAN reticulate

library(reticulate)
df <- data.frame(
  string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#>   string
#> 0     NA
#> 1      a
#> 2      b
#> 3      c
#> 4      d
p$string[0]
#> 'NA'

With this PR:

library(reticulate)
df <- data.frame(
  string = c(NA, letters[1:4])
)
p <- r_to_py(df)
p
#>   string
#> 0   None
#> 1      a
#> 2      b
#> 3      c
#> 4      d
p$string[0]
#> None

t-kalinowski

Great work @dfalbel!

Can you please add a NEWS entry?

Seeing this makes me wonder if we should similarly take this opportunity to change NA_character converting to Python None outside of a data.frame as well. E.g., r_to_py(c(NA, letters[1:3])) (or if pandas is not installed).

dfalbel · 2023-08-15T15:01:50Z

Yeah, I think we can do this in a future PR. We have to think if we want to be consistent with other data types, eg, what should we do with NA's in numeric, integers and logicals when outside a pandas data frame. And what's the desired behavior when getting back from R - should we simplify the vectors?

Preserve NAs when casting R data.frames to pandas.

93a0ca9

dfalbel requested a review from t-kalinowski August 10, 2023 13:52

dfalbel force-pushed the pandas-na branch from 6a08ade to 7789dfc Compare August 10, 2023 17:53

Add pandas version guardrails.

5b43152

dfalbel force-pushed the pandas-na branch from 7789dfc to 5b43152 Compare August 10, 2023 18:26

dfalbel added 5 commits August 10, 2023 15:42

Correct minimum version number.

5cef93d

Make the behavior opt-in.

fa8e9c2

update warning message.

acd06c3

Document the option

557459a

Improve test case.

e9f8366

dfalbel mentioned this pull request Aug 15, 2023

What should NA be converted to in Python? #197

Open

Merge branch 'main' into pandas-na

b27b39a

t-kalinowski approved these changes Aug 15, 2023

View reviewed changes

Add NEWS bullet.

d09c7f8

t-kalinowski merged commit dd2529b into main Aug 15, 2023
12 checks passed

dfalbel deleted the pandas-na branch August 15, 2023 17:27

dfalbel mentioned this pull request Oct 4, 2023

pandas DataFrame with np.nan are not converted correctly #207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve NAs when casting R data.frames to pandas. #1439

Preserve NAs when casting R data.frames to pandas. #1439

dfalbel commented Aug 10, 2023 •

edited

dfalbel commented Aug 10, 2023 •

edited

t-kalinowski left a comment

dfalbel commented Aug 15, 2023

Preserve NAs when casting R data.frames to pandas. #1439

Preserve NAs when casting R data.frames to pandas. #1439

Conversation

dfalbel commented Aug 10, 2023 • edited

dfalbel commented Aug 10, 2023 • edited

t-kalinowski left a comment

Choose a reason for hiding this comment

dfalbel commented Aug 15, 2023

dfalbel commented Aug 10, 2023 •

edited

dfalbel commented Aug 10, 2023 •

edited