[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

dmadeka · 2018-04-09T05:23:07Z

If I have a variable that is a (large) PyArrow Table - is there anyway to convert this to a Ray DataFrame with zero copy (or minimal copy)?

devin-petersohn · 2018-04-09T17:09:31Z

Hi @dmadeka, great question! We do not yet support zero copy conversion between Pandas on Ray and PyArrow Tables in the API, but we do have plans to support that in the near future. Are you using PyArrow in a cluster setting? How large is the Table?

robertnishihara · 2018-04-09T17:33:16Z

It may be possible to go from the pyarrow table to a pandas dataframe (using pyarrow) and then from the pandas dataframe to a pandas on Ray dataframe using Ray.

dmadeka · 2018-04-09T18:52:23Z

@devin-petersohn @robertnishihara Its a really really big table (45B rows plus). The arrow table is fine, but the conversion from pandas to a ray dataframe takes forever (since its done in a loop). I get the sense it should be easy if we're already an Arrow table?

Going from pandas to pandas on ray is really slow for large dataframes. Maybe Ill write to csv and read from there? That seems oddly hacky though given the ray internals

devin-petersohn · 2018-04-10T05:25:05Z

@dmadeka Yes, I can see that it would be slow. We don't have the from_pandas exposed from the ray.dataframe because we don't want it to be used.

If you build from current master, there's a hacky way we can do it.

import pandas
import ray.dataframe as pd
import ray
import pyarrow as pa

# Have some PyArrow Table called pyarrow_table
column_iter = list(pyarrow_table.itercolumns())

list_of_pandas_columns = [ray.put(pandas.DataFrame(column.to_pandas())) for column in column_iter]

df = pd.DataFrame(col_partitions=list_of_pandas_columns[:-1],
                  columns=[column.name for column in column_iter[:-1]],
                  index=ray.get(list_of_pandas_columns[-1]).values.flatten())

Ugly code written for clarity (hopefully). This creates a partition for each column, so you'll want to edit the list_of_pandas_columns = ... line if you want more columns in a partition.

Sorry if this is too hacky, we didn't intend on the API getting used this way, but for now this should work (I tested the code above locally and it works).

dmadeka · 2018-04-11T00:05:26Z

That works! but to_pandas is a copy operation :(

devin-petersohn · 2018-04-11T01:46:36Z

There is a parameter you can use for to_pandas called zero_copy_only: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

dmadeka · 2018-04-11T02:20:55Z

True, but it will fail for strings :(. Is there a way to directly refer to the memory?

robertnishihara · 2018-04-11T07:21:48Z

The pyarrow zero-copy facilities are most useful for numerical data and don't really help for numpy object arrays (which is how pandas represents strings under the hood).

dmadeka · 2018-04-11T14:49:07Z

@robertnishihara Totally, my point is that having to go through pandas encounters this bottleneck. If I already have the arrow table, I shouldn't have to worry about that. The problem is that pandas using python objects for strings - Arrow does not. Ill wait for the zero copy code! If you want me to try and contribute, a few pointers might be of some help!

@devin-petersohn Thanks! This is good for now!

devin-petersohn · 2018-07-30T22:27:10Z

@dmadeka I am opening an issue to track this one on the Modin repo, which is the new home for Pandas on Ray.

devin-petersohn changed the title ~~Convert PyArrow Table to Ray Dataframe with Zero Copy~~ [DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy Apr 9, 2018

devin-petersohn mentioned this issue Jul 30, 2018

Convert PyArrow Table to Ray Dataframe with Zero Copy modin-project/modin#62

Closed

devin-petersohn closed this as completed Jul 30, 2018

devin-petersohn mentioned this issue Jul 30, 2018

Unable to read non-comma delimited files with ray.dataframe.read_csv #1887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

dmadeka commented Apr 9, 2018

devin-petersohn commented Apr 9, 2018

robertnishihara commented Apr 9, 2018

dmadeka commented Apr 9, 2018

devin-petersohn commented Apr 10, 2018

dmadeka commented Apr 11, 2018

devin-petersohn commented Apr 11, 2018

dmadeka commented Apr 11, 2018

robertnishihara commented Apr 11, 2018

dmadeka commented Apr 11, 2018

devin-petersohn commented Jul 30, 2018

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

Comments

dmadeka commented Apr 9, 2018

devin-petersohn commented Apr 9, 2018

robertnishihara commented Apr 9, 2018

dmadeka commented Apr 9, 2018

devin-petersohn commented Apr 10, 2018

dmadeka commented Apr 11, 2018

devin-petersohn commented Apr 11, 2018

dmadeka commented Apr 11, 2018

robertnishihara commented Apr 11, 2018

dmadeka commented Apr 11, 2018

devin-petersohn commented Jul 30, 2018