Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should NA be converted to in Python? #197

Open
terrytangyuan opened this issue Mar 26, 2018 · 12 comments
Open

What should NA be converted to in Python? #197

terrytangyuan opened this issue Mar 26, 2018 · 12 comments
Assignees

Comments

@terrytangyuan
Copy link
Contributor

terrytangyuan commented Mar 26, 2018

Currently NA is converted to Python as True which doesn't make too much sense. This is because NA is logical in R.

> class(NA)
[1] "logical"
> r_to_py(NA)
True

Perhaps we should treat it as np.nan (similar to R's NaN) since that's the representation Python users use for missing data, e.g. missing data in pandas objects?

> r_to_py(NaN)
nan

"NaN is the the default missing value marker for reasons of computational speed and convenience. However, if you want to consider inf and -inf to be "NA" in computations, you can set pandas.options.mode.use_inf_as_na = True" (source). So this might be something we need to keep in mind when deciding how we approach this.

@Jiang-Li-backup
Copy link

What I just observed are: the missing values in a character column was converted to the character of NA, and the missing values in an integer column was converted to -2147483648

set.seed(1)
N <- 1000
df <- tibble(
dimension1 = sample(c("I", "II", "III"), N, replace = T),
dimension2 = sample(c("A", "B", "C"), N, replace = T),
measure1 = sample(1:10, N, replace = T),
measure2 = sample(1:10, N, replace = T)
)

df <- as_tibble(lapply(df, function(r) r[sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(r), replace = TRUE) ]))

print(r.df)

@hilaryparker
Copy link

Just confirming that missing values in integer columns are also being translated to -2147483648 in my context.

@kevinushey
Copy link
Collaborator

Some more context:

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
https://stackoverflow.com/questions/12708807/numpy-integer-nan

From what I understand, NumPy integer arrays don't have a 'missing value' number in the same way R integer vectors do, so we would have to either:

  1. Convert from an R integer vector into a NumPy "float64" array;
  2. Convert from an R integer vector into the Pandas "Int64" extension integer type.

I'm not sure if either of these is preferable, or even if this is something reticulate would want to do automatically. It seems instead like users should be able to request what kind of conversions should happen when going between the R and Python worlds, but I'm not sure what the interface for that should look like yet.

@hilaryparker
Copy link

hilaryparker commented Jun 24, 2019

maybe we can leverage dict e.g. dict('column_x' = 'Int64')

@jimjam-slam
Copy link

jimjam-slam commented Jul 5, 2019

This problem also caught me by surprise, but I'm still not super familiar with the Python side of things. Would it be a bad thing for NA_integer_ and NA (logical) to be converted to NaN as sensible defaults (with the option to overrride), as NA_real_ is?

Obviously NaN and NA aren't the same thing, but it seems preferable to True or -2147483648. Or is NaN not available for these data types?

EDIT: nevermind, it isn't available! From the Pandas docs (emphasis mine):

Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support for integer NA for more). Pandas provides a nullable integer array, which can be used by explicitly requesting the dtype:

Alternatively, the string alias dtype='Int64' (note the capital "I") can be used.

See Nullable Integer Data Type for more.

@jimjam-slam
Copy link

jimjam-slam commented Jul 5, 2019

This also seems to be problematic for dates (although it could be a related problem rather than the same one):

library(tidyverse)
library(reticulate)

d1 <- seq.Date(as.Date('2010-05-13'), as.Date('2010-05-16'), by = 'day')
d1
# [1] "2010-05-13" "2010-05-14" "2010-05-15" "2010-05-16"
r_to_py(d1)
# [datetime.date(2010, 5, 13), datetime.date(2010, 5, 14), datetime.date(2010, 5, 15), datetime.date(2010, 5, 16)]

d1[3] <- NA
d1
# [1] "2010-05-13" "2010-05-14" NA           "2010-05-16"
r_to_py(d1)
# Error in iso[[2]] : subscript out of bounds

d1[3] = as.Date(NaN)
# Error in as.Date.numeric(NaN) : 'origin' must be supplied
d1[3] = as.Date(NaN, origin = as.Date('1970-01-01'))
d1
# [1] "2010-05-13" "2010-05-14" NA           "2010-05-16"

@aazaff
Copy link

aazaff commented Dec 19, 2019

Not to randomly drop in and necropost, but has anyone come across an answer to this problem? Surely the python community has some way to express ternary logic.

@tmbluth
Copy link

tmbluth commented Jan 28, 2020

Also getting my NA's converting to -2147483648. @Rensa are you suggesting that I just convert all R NA's into R NaN's before using that data in reticulated Python?

@jimjam-slam
Copy link

jimjam-slam commented Jan 28, 2020

@tmbluth If you can deal with having your column be floats rather than ints, that might be a good way to go (I think it's how I'm getting around it!):

c(5:8, NA, 10:12) %>%
as.numeric() %>%
tidyr::replace_na(NaN) %>%
reticulate::r_to_py()
# [5.0, 6.0, 7.0, 8.0, nan, 10.0, 11.0, 12.0]

The downside, of course, is that they're floats, so if you need some integer behaviours, that could introduce its own problems!

I don't know if reticulate has a way to specify conversion to other base types like Int64 (not data structures, as with data frames and arrays).

@carissalow
Copy link

I still found the same problem in the most recent version. Pandas df string columns with missing values are converted into list columns in R

@SaintRod
Copy link

SaintRod commented Nov 3, 2021

Recently, I had an R data frame with two character-type columns that were completely blank. When I converted to python via reticulate::r_to_py(df), the missing values were re-coded to "NA" strings or "NaN" strings when I used tidyr::replace_na(NaN) to convert the character-type columns from NA to NaN.

I'm uploading to a memsql database via reticulate (because it's faster) and the missing values for the string/text/character fields are populated as "NA" or "NaN" in the database. If I use odbc/DBI the data is correctly uploaded to the database - that is, NAs are coded as NULL in the database.

Any idea how to resolve this?

reticulate v: 1.20

@dfalbel
Copy link
Member

dfalbel commented Aug 15, 2023

A possible solution for this is suggested in #1439, also fixing the 'missing values were re-coded to "NA" strings or "NaN" '.
Also #1428 fixes the df string columns with missing values are converted into list columns
#1427 added support for casting from Pandas nullable data types into R too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants