Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

Closed
dmadeka opened this issue Apr 9, 2018 · 10 comments
Closed

[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858

dmadeka opened this issue Apr 9, 2018 · 10 comments

Comments

@dmadeka
Copy link

dmadeka commented Apr 9, 2018

If I have a variable that is a (large) PyArrow Table - is there anyway to convert this to a Ray DataFrame with zero copy (or minimal copy)?

@devin-petersohn
Copy link
Member

Hi @dmadeka, great question! We do not yet support zero copy conversion between Pandas on Ray and PyArrow Tables in the API, but we do have plans to support that in the near future. Are you using PyArrow in a cluster setting? How large is the Table?

@devin-petersohn devin-petersohn changed the title Convert PyArrow Table to Ray Dataframe with Zero Copy [DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy Apr 9, 2018
@robertnishihara
Copy link
Collaborator

It may be possible to go from the pyarrow table to a pandas dataframe (using pyarrow) and then from the pandas dataframe to a pandas on Ray dataframe using Ray.

@dmadeka
Copy link
Author

dmadeka commented Apr 9, 2018

@devin-petersohn @robertnishihara Its a really really big table (45B rows plus). The arrow table is fine, but the conversion from pandas to a ray dataframe takes forever (since its done in a loop). I get the sense it should be easy if we're already an Arrow table?

Going from pandas to pandas on ray is really slow for large dataframes. Maybe Ill write to csv and read from there? That seems oddly hacky though given the ray internals

@devin-petersohn
Copy link
Member

@dmadeka Yes, I can see that it would be slow. We don't have the from_pandas exposed from the ray.dataframe because we don't want it to be used.

If you build from current master, there's a hacky way we can do it.

import pandas
import ray.dataframe as pd
import ray
import pyarrow as pa

# Have some PyArrow Table called pyarrow_table
column_iter = list(pyarrow_table.itercolumns())

list_of_pandas_columns = [ray.put(pandas.DataFrame(column.to_pandas())) for column in column_iter]

df = pd.DataFrame(col_partitions=list_of_pandas_columns[:-1],
                  columns=[column.name for column in column_iter[:-1]],
                  index=ray.get(list_of_pandas_columns[-1]).values.flatten())

Ugly code written for clarity (hopefully). This creates a partition for each column, so you'll want to edit the list_of_pandas_columns = ... line if you want more columns in a partition.

Sorry if this is too hacky, we didn't intend on the API getting used this way, but for now this should work (I tested the code above locally and it works).

@dmadeka
Copy link
Author

dmadeka commented Apr 11, 2018

That works! but to_pandas is a copy operation :(

@devin-petersohn
Copy link
Member

There is a parameter you can use for to_pandas called zero_copy_only: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

@dmadeka
Copy link
Author

dmadeka commented Apr 11, 2018

True, but it will fail for strings :(. Is there a way to directly refer to the memory?

@robertnishihara
Copy link
Collaborator

The pyarrow zero-copy facilities are most useful for numerical data and don't really help for numpy object arrays (which is how pandas represents strings under the hood).

@dmadeka
Copy link
Author

dmadeka commented Apr 11, 2018

@robertnishihara Totally, my point is that having to go through pandas encounters this bottleneck. If I already have the arrow table, I shouldn't have to worry about that. The problem is that pandas using python objects for strings - Arrow does not. Ill wait for the zero copy code! If you want me to try and contribute, a few pointers might be of some help!

@devin-petersohn Thanks! This is good for now!

@devin-petersohn
Copy link
Member

@dmadeka I am opening an issue to track this one on the Modin repo, which is the new home for Pandas on Ray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants