# Moving pandas DataFrames faster between Python and R

We create a test `pandas.DataFrame`. The size is set to show
a noticeable able effect without waiting too long for the
slowest conversion on the laptop the notebook ran on. Feel free
to change the variable `_N` to what suits best your hardware,
and your patience.

In [1]:
import pandas as pd
# Number or rows in the DataFrame.
_N = 500000
pd_dataf = pd.DataFrame({'x': range(_N),
                         'y': ['abc', 'def'] * (_N//2)})

Next we load the ipython/jupyter extension in R to communicate with R in a (Python) notebook.

In [2]:
%load_ext rpy2.ipython

With the extension loaded, the `DataFrame` can be imported in a R cell (declared with `%%R`) using the argument `-i`. This is a reasonably-size data table and it takes few seconds for the
conversion system to create a copy of it in R on the machine where the notebook was written.

In [3]:
%%time
%%R -i pd_dataf
print(head(pd_dataf))
rm(pd_dataf)

  x   y
0 0 abc
1 1 def
2 2 abc
3 3 def
4 4 abc
5 5 def
CPU times: user 7.13 s, sys: 119 ms, total: 7.25 s
Wall time: 7.26 s


The conversion of a `pandas.DataFrame` can be accelerated by using Apache Arrow as an intermediate step.
The package `pyarrow` is using efficient compiled code to go from a `pandas.DataFrame` to an Arrow data
structure, and the R package `arrow` can go from an Arrow data structure to an R `data.frame`.

The package `rpy2_arrow` can help manage the conversion between Python wrappers to Arrow data structures (Python package `pyarrow`) and R wrappers to Arrow data structures (R package `arrow`). Creating a custom converter for `rpy2` is done in few lines of code. 

In [4]:
import pyarrow
import rpy2.robjects.conversion
import rpy2.rinterface
import rpy2_arrow.pyarrow_rarrow as pyra

conv = rpy2.robjects.conversion.Converter('Pandas to data.frame')

@conv.py2rpy.register(pd.DataFrame)
def py2rpy_pandas(dataf):
    pa_tbl = pyarrow.Table.from_pandas(dataf)
    ra_tbl = pyra.converter.py2rpy(pa_tbl)
    return rpy2.rinterface.baseenv['as.data.frame'](ra_tbl)

conv = rpy2.ipython.rmagic.converter + conv

Our custom converter `conv` can be specified as a parameter to `%%R`:

In [5]:
%%time
%%R -i pd_dataf -c conv
print(head(pd_dataf))
rm(pd_dataf)

  x   y
1 0 abc
2 1 def
3 2 abc
4 3 def
5 4 abc
6 5 def
CPU times: user 5.6 s, sys: 75.7 ms, total: 5.68 s
Wall time: 5.69 s


The conversion is faster, but much less so than one could have hoped. On the machine this notebook was written on is barely 25% faster.

It is also possible to only convert to an Arrow data structure. The converter is then one line shorter:

In [6]:
conv2 = rpy2.robjects.conversion.Converter('Pandas to pyarrow')

@conv2.py2rpy.register(pd.DataFrame)
def py2rpy_pandas(dataf):
    pa_tbl = pyarrow.Table.from_pandas(dataf)
    return pyra.converter.py2rpy(pa_tbl)

conv2 = rpy2.ipython.rmagic.converter + conv2

In [7]:
%%time
%%R -i pd_dataf -c conv2
print(head(pd_dataf))
rm(pd_dataf)

Table
6 rows x 2 columns
$x <int64>
$y <string>

See $metadata for additional Schema metadata
CPU times: user 31.8 ms, sys: 3.63 ms, total: 35.5 ms
Wall time: 34.6 ms


This time the conversion is much faster.

The R package `arrow` implements methods for its wrapped for Arrow data structures to make their behavior close to `data.frame` objects.
There will be many situations where this will be sufficient to work with the data table in R, while benefiting from the very significant speed gain.
For example with the R package `dplyr`:

In [8]:
%%R
suppressMessages(require(dplyr))

In [9]:
%%time
%%R -i pd_dataf -c conv2

pd_dataf %>%
  group_by(y) %>%
  summarize(n = length(x))

[90m# A tibble: 2 x 2[39m
  y          n
  [3m[90m<chr>[39m[23m  [3m[90m<int>[39m[23m
[90m1[39m abc   [4m2[24m[4m5[24m[4m0[24m000
[90m2[39m def   [4m2[24m[4m5[24m[4m0[24m000
CPU times: user 195 ms, sys: 20.7 ms, total: 215 ms
Wall time: 220 ms
