New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What should NA
be converted to in Python?
#197
Comments
What I just observed are: the missing values in a character column was converted to the character of NA, and the missing values in an integer column was converted to -2147483648
|
Just confirming that missing values in integer columns are also being translated to |
Some more context: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html From what I understand, NumPy integer arrays don't have a 'missing value' number in the same way R integer vectors do, so we would have to either:
I'm not sure if either of these is preferable, or even if this is something reticulate would want to do automatically. It seems instead like users should be able to request what kind of conversions should happen when going between the R and Python worlds, but I'm not sure what the interface for that should look like yet. |
maybe we can leverage |
This problem also caught me by surprise, but I'm still not super familiar with the Python side of things. Would it be a bad thing for Obviously EDIT: nevermind, it isn't available! From the Pandas docs (emphasis mine):
|
This also seems to be problematic for dates (although it could be a related problem rather than the same one): library(tidyverse)
library(reticulate)
d1 <- seq.Date(as.Date('2010-05-13'), as.Date('2010-05-16'), by = 'day')
d1
# [1] "2010-05-13" "2010-05-14" "2010-05-15" "2010-05-16"
r_to_py(d1)
# [datetime.date(2010, 5, 13), datetime.date(2010, 5, 14), datetime.date(2010, 5, 15), datetime.date(2010, 5, 16)]
d1[3] <- NA
d1
# [1] "2010-05-13" "2010-05-14" NA "2010-05-16"
r_to_py(d1)
# Error in iso[[2]] : subscript out of bounds
d1[3] = as.Date(NaN)
# Error in as.Date.numeric(NaN) : 'origin' must be supplied
d1[3] = as.Date(NaN, origin = as.Date('1970-01-01'))
d1
# [1] "2010-05-13" "2010-05-14" NA "2010-05-16" |
Not to randomly drop in and necropost, but has anyone come across an answer to this problem? Surely the python community has some way to express ternary logic. |
Also getting my NA's converting to -2147483648. @Rensa are you suggesting that I just convert all R NA's into R NaN's before using that data in reticulated Python? |
@tmbluth If you can deal with having your column be floats rather than ints, that might be a good way to go (I think it's how I'm getting around it!): c(5:8, NA, 10:12) %>%
as.numeric() %>%
tidyr::replace_na(NaN) %>%
reticulate::r_to_py()
# [5.0, 6.0, 7.0, 8.0, nan, 10.0, 11.0, 12.0] The downside, of course, is that they're floats, so if you need some integer behaviours, that could introduce its own problems! I don't know if |
I still found the same problem in the most recent version. Pandas df string columns with missing values are converted into list columns in R |
Recently, I had an R data frame with two character-type columns that were completely blank. When I converted to python via reticulate::r_to_py(df), the missing values were re-coded to "NA" strings or "NaN" strings when I used tidyr::replace_na(NaN) to convert the character-type columns from NA to NaN. I'm uploading to a memsql database via reticulate (because it's faster) and the missing values for the string/text/character fields are populated as "NA" or "NaN" in the database. If I use odbc/DBI the data is correctly uploaded to the database - that is, NAs are coded as NULL in the database. Any idea how to resolve this? reticulate v: 1.20 |
Currently
NA
is converted to Python asTrue
which doesn't make too much sense. This is becauseNA
is logical in R.Perhaps we should treat it as
np.nan
(similar to R'sNaN
) since that's the representation Python users use for missing data, e.g. missing data in pandas objects?"
NaN
is the the default missing value marker for reasons of computational speed and convenience. However, if you want to considerinf
and-inf
to be "NA" in computations, you can setpandas.options.mode.use_inf_as_na = True
" (source). So this might be something we need to keep in mind when deciding how we approach this.The text was updated successfully, but these errors were encountered: