What should `NA` be converted to in Python? #197

terrytangyuan · 2018-03-26T20:32:50Z

Currently NA is converted to Python as True which doesn't make too much sense. This is because NA is logical in R.

> class(NA)
[1] "logical"
> r_to_py(NA)
True

Perhaps we should treat it as np.nan (similar to R's NaN) since that's the representation Python users use for missing data, e.g. missing data in pandas objects?

> r_to_py(NaN)
nan

"NaN is the the default missing value marker for reasons of computational speed and convenience. However, if you want to consider inf and -inf to be "NA" in computations, you can set pandas.options.mode.use_inf_as_na = True" (source). So this might be something we need to keep in mind when deciding how we approach this.

The text was updated successfully, but these errors were encountered:

Jiang-Li-backup · 2018-04-17T21:14:34Z

What I just observed are: the missing values in a character column was converted to the character of NA, and the missing values in an integer column was converted to -2147483648

set.seed(1)
N <- 1000
df <- tibble(
dimension1 = sample(c("I", "II", "III"), N, replace = T),
dimension2 = sample(c("A", "B", "C"), N, replace = T),
measure1 = sample(1:10, N, replace = T),
measure2 = sample(1:10, N, replace = T)
)

df <- as_tibble(lapply(df, function(r) r[sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(r), replace = TRUE) ]))

print(r.df)

hilaryparker · 2019-06-24T22:40:50Z

Just confirming that missing values in integer columns are also being translated to -2147483648 in my context.

kevinushey · 2019-06-24T23:00:15Z

Some more context:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
https://stackoverflow.com/questions/12708807/numpy-integer-nan

From what I understand, NumPy integer arrays don't have a 'missing value' number in the same way R integer vectors do, so we would have to either:

Convert from an R integer vector into a NumPy "float64" array;
Convert from an R integer vector into the Pandas "Int64" extension integer type.

I'm not sure if either of these is preferable, or even if this is something reticulate would want to do automatically. It seems instead like users should be able to request what kind of conversions should happen when going between the R and Python worlds, but I'm not sure what the interface for that should look like yet.

hilaryparker · 2019-06-24T23:04:19Z

maybe we can leverage dict e.g. dict('column_x' = 'Int64')

jimjam-slam · 2019-07-05T00:45:30Z

This problem also caught me by surprise, but I'm still not super familiar with the Python side of things. Would it be a bad thing for NA_integer_ and NA (logical) to be converted to NaN as sensible defaults (with the option to overrride), as NA_real_ is?

Obviously NaN and NA aren't the same thing, but it seems preferable to True or -2147483648. Or is NaN not available for these data types?

EDIT: nevermind, it isn't available! From the Pandas docs (emphasis mine):

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). Pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

Alternatively, the string alias dtype='Int64' (note the capital "I") can be used.

See Nullable Integer Data Type for more.

jimjam-slam · 2019-07-05T01:48:49Z

This also seems to be problematic for dates (although it could be a related problem rather than the same one):

library(tidyverse)
library(reticulate)

d1 <- seq.Date(as.Date('2010-05-13'), as.Date('2010-05-16'), by = 'day')
d1
# [1] "2010-05-13" "2010-05-14" "2010-05-15" "2010-05-16"
r_to_py(d1)
# [datetime.date(2010, 5, 13), datetime.date(2010, 5, 14), datetime.date(2010, 5, 15), datetime.date(2010, 5, 16)]

d1[3] <- NA
d1
# [1] "2010-05-13" "2010-05-14" NA           "2010-05-16"
r_to_py(d1)
# Error in iso[[2]] : subscript out of bounds

d1[3] = as.Date(NaN)
# Error in as.Date.numeric(NaN) : 'origin' must be supplied
d1[3] = as.Date(NaN, origin = as.Date('1970-01-01'))
d1
# [1] "2010-05-13" "2010-05-14" NA           "2010-05-16"

aazaff · 2019-12-19T16:13:17Z

Not to randomly drop in and necropost, but has anyone come across an answer to this problem? Surely the python community has some way to express ternary logic.

tmbluth · 2020-01-28T22:22:29Z

Also getting my NA's converting to -2147483648. @Rensa are you suggesting that I just convert all R NA's into R NaN's before using that data in reticulated Python?

jimjam-slam · 2020-01-28T22:31:22Z

@tmbluth If you can deal with having your column be floats rather than ints, that might be a good way to go (I think it's how I'm getting around it!):

c(5:8, NA, 10:12) %>%
as.numeric() %>%
tidyr::replace_na(NaN) %>%
reticulate::r_to_py()
# [5.0, 6.0, 7.0, 8.0, nan, 10.0, 11.0, 12.0]

The downside, of course, is that they're floats, so if you need some integer behaviours, that could introduce its own problems!

I don't know if reticulate has a way to specify conversion to other base types like Int64 (not data structures, as with data frames and arrays).

carissalow · 2021-04-13T20:15:49Z

I still found the same problem in the most recent version. Pandas df string columns with missing values are converted into list columns in R

SaintRod · 2021-11-03T22:06:09Z

Recently, I had an R data frame with two character-type columns that were completely blank. When I converted to python via reticulate::r_to_py(df), the missing values were re-coded to "NA" strings or "NaN" strings when I used tidyr::replace_na(NaN) to convert the character-type columns from NA to NaN.

I'm uploading to a memsql database via reticulate (because it's faster) and the missing values for the string/text/character fields are populated as "NA" or "NaN" in the database. If I use odbc/DBI the data is correctly uploaded to the database - that is, NAs are coded as NULL in the database.

Any idea how to resolve this?

reticulate v: 1.20

dfalbel · 2023-08-15T10:39:31Z

A possible solution for this is suggested in #1439, also fixing the 'missing values were re-coded to "NA" strings or "NaN" '.
Also #1428 fixes the df string columns with missing values are converted into list columns
#1427 added support for casting from Pandas nullable data types into R too.

terrytangyuan mentioned this issue Mar 29, 2018

Add py_func() to wrap R function in a Python function with correct signature #195

Merged

terrytangyuan mentioned this issue Apr 7, 2018

pandas DataFrame with np.nan are not converted correctly #207

Open

dfalbel self-assigned this Aug 15, 2023

dfalbel closed this as completed Aug 15, 2023

dfalbel reopened this Aug 15, 2023

This was referenced Aug 15, 2023

Preserve NAs when casting R data.frames to pandas. #1439

Merged

NA casts outside of data.frames #1446

Open

LucieContamin mentioned this issue Mar 1, 2024

create_task_id() for horizon returns an error if required and optional set to NULL Infectious-Disease-Modeling-Hubs/hubAdmin#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should `NA` be converted to in Python? #197

What should `NA` be converted to in Python? #197

terrytangyuan commented Mar 26, 2018 •

edited

Jiang-Li-backup commented Apr 17, 2018

hilaryparker commented Jun 24, 2019

kevinushey commented Jun 24, 2019

hilaryparker commented Jun 24, 2019 •

edited

jimjam-slam commented Jul 5, 2019 •

edited

jimjam-slam commented Jul 5, 2019 •

edited

aazaff commented Dec 19, 2019

tmbluth commented Jan 28, 2020

jimjam-slam commented Jan 28, 2020 •

edited

carissalow commented Apr 13, 2021

SaintRod commented Nov 3, 2021 •

edited

dfalbel commented Aug 15, 2023 •

edited

What should NA be converted to in Python? #197

What should NA be converted to in Python? #197

Comments

terrytangyuan commented Mar 26, 2018 • edited

Jiang-Li-backup commented Apr 17, 2018

hilaryparker commented Jun 24, 2019

kevinushey commented Jun 24, 2019

hilaryparker commented Jun 24, 2019 • edited

jimjam-slam commented Jul 5, 2019 • edited

jimjam-slam commented Jul 5, 2019 • edited

aazaff commented Dec 19, 2019

tmbluth commented Jan 28, 2020

jimjam-slam commented Jan 28, 2020 • edited

carissalow commented Apr 13, 2021

SaintRod commented Nov 3, 2021 • edited

dfalbel commented Aug 15, 2023 • edited

What should `NA` be converted to in Python? #197

What should `NA` be converted to in Python? #197

terrytangyuan commented Mar 26, 2018 •

edited

hilaryparker commented Jun 24, 2019 •

edited

jimjam-slam commented Jul 5, 2019 •

edited

jimjam-slam commented Jul 5, 2019 •

edited

jimjam-slam commented Jan 28, 2020 •

edited

SaintRod commented Nov 3, 2021 •

edited

dfalbel commented Aug 15, 2023 •

edited