Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to convert pandas dataframe: integer <some very large integer> does not fit 'int' #598

Open
lgautier opened this issue Sep 20, 2019 · 7 comments
Assignees

Comments

@lgautier
Copy link
Member

Original report by Brian Lie (Bitbucket: [Brian Lie](https://bitbucket.org/Brian Lie), ).


Hello, so I found that pandas dataframes containing very large integers can’t be converted into R’s form yet?

There’s a demand to pass a pandas dataframe to R inside one Jupyter notebook, but we would like to find alternatives other than saving the dataframe in a file and load it in R (using IPython’s Rcell magics) later.

import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter

pd_df = pd.DataFrame({'int_values': [1,2,30123456789, 4],'str_values': ['abc', 'def', 'ghi', 'jkl']})

print(pd_df)

with localconverter(ro.default_converter + pandas2ri.converter):
    r_from_pd_df = ro.conversion.py2rpy(pd_df)

print(r_from_pd_df)

When the sample above is executed, some error similar to this below will appear:

/home/user/playground/rpy2/rpy2-3.1.0/rpy2/robjects/pandas2ri.py:60: UserWarning: Error while trying to convert the column "int_values". Fall back to string conversion. The error is: integer 30123456789 does not fit 'int'

the error occurs in the line when the conversion is being done.

Here’s the version informations:

  • rpy2 3.1.0
  • Python 3.5.3
  • R 3.4.4
  • Ubuntu 18.04

Full stack trace:

/home/user/playground/rpy2/rpy2-3.1.0/rpy2/robjects/pandas2ri.py:60: UserWarning: Error while trying to convert the column "int_values". Fall back to string conversion. The error is: integer 30123456789 does not fit 'int'
  % (name, str(e)))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_object(cls, obj)
    368             mv = memoryview(obj)
--> 369             res = cls.from_memoryview(mv)
    370         except (TypeError, ValueError):

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs)
     27     def _(*args, **kwargs):
---> 28         cdata = function(*args, **kwargs)
     29         # TODO: test cdata is of the expected CType

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_memoryview(cls, mview)
    345             )
--> 346             raise ValueError(msg)
    347         r_vector = None

ValueError: Incompatible C type sizes. The R array type is 4 bytes while the Python array type is 8 bytes.

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/pandas2ri.py in py2rpy_pandasdataframe(obj)
     54         try:
---> 55             od[name] = conversion.py2rpy(values)
     56         except Exception as e:

~/miniconda3/envs/jupyter/lib/python3.5/functools.py in wrapper(*args, **kw)
    744     def wrapper(*args, **kw):
--> 745         return dispatch(args[0].__class__)(*args, **kw)
    746 

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/pandas2ri.py in py2rpy_pandasseries(obj)
    155 
--> 156         res = func(obj)
    157         if len(obj.shape) == 1:

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/numpy2ri.py in numpy2rpy(o)
     86     if o.dtype.kind in _kinds:
---> 87         res = _numpyarray_to_r(o, _kinds[o.dtype.kind])
     88     # R does not support unsigned types:

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/numpy2ri.py in _numpyarray_to_r(a, func)
     55     # "F" means "use column-major order"
---> 56     vec = func(numpy.ravel(a, order='F'))
     57     # TODO: no dimnames ?

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/vectors.py in __init__(self, obj)
    412     def __init__(self, obj):
--> 413         super().__init__(obj)
    414         self._add_rops()

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in __init__(self, obj)
    287         elif isinstance(obj, collections.abc.Sized):
--> 288             super().__init__(type(self).from_object(obj).__sexp__)
    289         else:

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_object(cls, obj)
    371             try:
--> 372                 res = cls.from_iterable(obj)
    373             except ValueError:

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs)
     27     def _(*args, **kwargs):
---> 28         cdata = function(*args, **kwargs)
     29         # TODO: test cdata is of the expected CType

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_iterable(cls, iterable, populate_func)
    318                 cls._populate_r_vector(iterable,
--> 319                                        r_vector)
    320             else:

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in _populate_r_vector(cls, iterable, r_vector)
    301                                   cls._R_SET_VECTOR_ELT,
--> 302                                   cls._CAST_IN)
    303 

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in _populate_r_vector(iterable, r_vector, set_elt, cast_value)
    238     for i, v in enumerate(iterable):
--> 239         set_elt(r_vector, i, cast_value(v))
    240 

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/openrlib.py in _set_integer_elt_fallback(vec, i, value)
     99 def _set_integer_elt_fallback(vec, i: int, value):
--> 100     INTEGER(vec)[i] = value
    101 

OverflowError: integer 30123456789 does not fit 'int'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_object(cls, obj)
    367         try:
--> 368             mv = memoryview(obj)
    369             res = cls.from_memoryview(mv)

TypeError: memoryview: a bytes-like object is required, not 'Series'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-14-198b4de8bd29> in <module>
     14 
     15 with localconverter(ro.default_converter + pandas2ri.converter):
---> 16     r_from_pd_df = ro.conversion.py2rpy(pd_df)
     17 
     18 print(r_from_pd_df)

~/miniconda3/envs/jupyter/lib/python3.5/functools.py in wrapper(*args, **kw)
    743 
    744     def wrapper(*args, **kw):
--> 745         return dispatch(args[0].__class__)(*args, **kw)
    746 
    747     registry[object] = func

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/pandas2ri.py in py2rpy_pandasdataframe(obj)
     59                           'The error is: %s'
     60                           % (name, str(e)))
---> 61             od[name] = StrVector(values)
     62 
     63     return DataFrame(od)

~/playground/rpy2/rpy2-3.1.0/rpy2/robjects/vectors.py in __init__(self, obj)
    382 
    383     def __init__(self, obj):
--> 384         super().__init__(obj)
    385         self._add_rops()
    386 

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in __init__(self, obj)
    286             super().__init__(obj)
    287         elif isinstance(obj, collections.abc.Sized):
--> 288             super().__init__(type(self).from_object(obj).__sexp__)
    289         else:
    290             raise TypeError('The constructor must be called '

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_object(cls, obj)
    370         except (TypeError, ValueError):
    371             try:
--> 372                 res = cls.from_iterable(obj)
    373             except ValueError:
    374                 msg = ('The class methods from_memoryview() and '

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs)
     26 def _cdata_res_to_rinterface(function):
     27     def _(*args, **kwargs):
---> 28         cdata = function(*args, **kwargs)
     29         # TODO: test cdata is of the expected CType
     30         return _cdata_to_rinterface(cdata)

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in from_iterable(cls, iterable, populate_func)
    317             if populate_func is None:
    318                 cls._populate_r_vector(iterable,
--> 319                                        r_vector)
    320             else:
    321                 populate_func(iterable, r_vector)

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in _populate_r_vector(cls, iterable, r_vector)
    300                                   r_vector,
    301                                   cls._R_SET_VECTOR_ELT,
--> 302                                   cls._CAST_IN)
    303 
    304     @classmethod

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in _populate_r_vector(iterable, r_vector, set_elt, cast_value)
    237 def _populate_r_vector(iterable, r_vector, set_elt, cast_value):
    238     for i, v in enumerate(iterable):
--> 239         set_elt(r_vector, i, cast_value(v))
    240 
    241 

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/sexp.py in _as_charsxp_cdata(x)
    430         return x.__sexp__._cdata
    431     else:
--> 432         return conversion._str_to_charsxp(x)
    433 
    434 

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/conversion.py in _str_to_charsxp(val)
    118         s = rlib.R_NaString
    119     else:
--> 120         cchar = _str_to_cchar(val)
    121         s = rlib.Rf_mkCharCE(cchar, _CE_UTF8)
    122     return s

~/playground/rpy2/rpy2-3.1.0/rpy2/rinterface_lib/conversion.py in _str_to_cchar(s, encoding)
     97 def _str_to_cchar(s, encoding: str = 'utf-8'):
     98     # TODO: use isStrinb and installTrChar
---> 99     b = s.encode(encoding)
    100     return ffi.new('char[]', b)
    101 

AttributeError: 'int' object has no attribute 'encode'

@lgautier
Copy link
Member Author

Original comment by Brian Lie (Bitbucket: [Brian Lie](https://bitbucket.org/Brian Lie), ).


By the way if you checked the type of int_values column of the pandas dataframe above ( like type(pd_df['int_values'][2]) )

it will show up as numpy.int64

So this is like an int64 problem?

@lgautier
Copy link
Member Author

Original comment by Laurent Gautier (Bitbucket: lgautier, GitHub: lgautier).


Yes. R does not natively support int64 values (it does support “longer than 32 bit index / arrays length” though).

@lgautier lgautier transferred this issue from another repository Dec 27, 2019
@AnselmC
Copy link

AnselmC commented Jun 18, 2020

Hi,
is there a solution/work-around for this?
Thanks!

@artur-ba
Copy link

Hi,
Any news on this issue? This week my team encountered this problem.
Our environment:
rpy2 3.4.5
Python 3.9.7
R 4.1.2

@AnselmC
Copy link

AnselmC commented Feb 27, 2022

@artur-ba my workaround is to convert the dataframe to a csv string and then load that string via an R function:

csv_text = df.to_csv(index=False)
robjects.r(f"source('r_utils.R')")
get_dataframe_from_csv_text = robjects.globalenv["get_dataframe_from_csv_text"]
r_df = get_dataframe_from_csv_text(csv_text=csv_text)

where r_utils.R contains:

get_dataframe_from_csv_text <- function(csv_text) {
  df <- read.csv(text = csv_text, check.names = FALSE)
  return(df)
}

@lgautier lgautier self-assigned this Feb 27, 2022
@lgautier
Copy link
Member Author

@artur-ba - I am looking at whether I can fix this for release 3.5.0.

@AnselmC - serializing from Python to CSV, and then unserialize to R seems quite inefficient. rpy2-arrow would be better (https://github.com/rpy2/rpy2-arrow).

@lgautier
Copy link
Member Author

I just checked with R 4.1. It still can't handle such integers.

> as.integer(30123456789)
[1] NA
Warning message:
NAs introduced by coercion to integer range 

64-bit floats work though:

pd_df = pd.DataFrame({
    'int_values': [1.0, 2,30123456789, 4]
    'str_values': ['abc', 'def', 'ghi', 'jkl']})

The options would be:
a. silently cast into a float
b. push how to handle to the Python code. Cast to float, use the R package bit64 to convert, or consider this an input error and handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants