-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataFrame] Convert PyArrow Table to Ray Dataframe with Zero Copy #1858
Comments
Hi @dmadeka, great question! We do not yet support zero copy conversion between Pandas on Ray and PyArrow Tables in the API, but we do have plans to support that in the near future. Are you using PyArrow in a cluster setting? How large is the Table? |
It may be possible to go from the pyarrow table to a pandas dataframe (using pyarrow) and then from the pandas dataframe to a pandas on Ray dataframe using Ray. |
@devin-petersohn @robertnishihara Its a really really big table (45B rows plus). The arrow table is fine, but the conversion from pandas to a ray dataframe takes forever (since its done in a loop). I get the sense it should be easy if we're already an Arrow table? Going from pandas to pandas on ray is really slow for large dataframes. Maybe Ill write to csv and read from there? That seems oddly hacky though given the ray internals |
@dmadeka Yes, I can see that it would be slow. We don't have the If you build from current master, there's a hacky way we can do it. import pandas
import ray.dataframe as pd
import ray
import pyarrow as pa
# Have some PyArrow Table called pyarrow_table
column_iter = list(pyarrow_table.itercolumns())
list_of_pandas_columns = [ray.put(pandas.DataFrame(column.to_pandas())) for column in column_iter]
df = pd.DataFrame(col_partitions=list_of_pandas_columns[:-1],
columns=[column.name for column in column_iter[:-1]],
index=ray.get(list_of_pandas_columns[-1]).values.flatten()) Ugly code written for clarity (hopefully). This creates a partition for each column, so you'll want to edit the Sorry if this is too hacky, we didn't intend on the API getting used this way, but for now this should work (I tested the code above locally and it works). |
That works! but to_pandas is a copy operation :( |
There is a parameter you can use for |
True, but it will fail for strings :(. Is there a way to directly refer to the memory? |
The pyarrow zero-copy facilities are most useful for numerical data and don't really help for numpy object arrays (which is how pandas represents strings under the hood). |
@robertnishihara Totally, my point is that having to go through pandas encounters this bottleneck. If I already have the arrow table, I shouldn't have to worry about that. The problem is that pandas using python objects for strings - Arrow does not. Ill wait for the zero copy code! If you want me to try and contribute, a few pointers might be of some help! @devin-petersohn Thanks! This is good for now! |
If I have a variable that is a (large) PyArrow Table - is there anyway to convert this to a Ray DataFrame with zero copy (or minimal copy)?
The text was updated successfully, but these errors were encountered: