Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Create pyarrow structs via vaex #2032

Open
Ben-Epstein opened this issue Apr 26, 2022 · 9 comments
Open

[FEATURE-REQUEST] Create pyarrow structs via vaex #2032

Ben-Epstein opened this issue Apr 26, 2022 · 9 comments

Comments

@Ben-Epstein
Copy link
Contributor

Description
Since vaex provides all these great struct operations, it would be great if we could create structs in vaex directly via massive dataframes

Additional context

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = pa.StructArray.from_arrays(
    arrays=[df.x.values, df.y.values, df.z.values], names=["x", "y", "z"]
)
df

Now we can use structs, but we brought everything into memory

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = pa.StructArray.from_arrays(
    arrays=[df.x, df.y, df.z], names=["x", "y", "z"]
)
df

that would be great, but it fails.

Even better would be a helper function, something like

import pyarrow as pa
import vaex


df = vaex.example()

df["xyz"] = df.func.create_arrow_struct(df.x, df.y, df.z)
df

or something similar

@maartenbreddels
Copy link
Member

maartenbreddels commented Apr 26, 2022

What do you think of this:

@vaex.register_function()
def create_arrow_struct(**kwargs):
    return pa.StructArray.from_arrays(kwargs.values(), kwargs.keys())

df = vaex.datasets.titanic()
df.func. create_arrow_struct(name=df['name'], age=df['age'])

@Ben-Epstein
Copy link
Contributor Author

Ben-Epstein commented Apr 29, 2022

That's great!

But @maartenbreddels it doesn't work if you try to listAgg that struct column. Maybe that's a new issue, not sure.

@maartenbreddels
Copy link
Member

Yeah, we can only do that on primitives and strings. Maybe we can split the struct, and merge it back again automatically.

@maartenbreddels
Copy link
Member

@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?

@maartenbreddels
Copy link
Member

This is the opposite of #2072 so once we merge that we should take another look at this.

@JovanVeljanoski
Copy link
Member

@JovanVeljanoski any opinions on this? How should we attach this, or do you like my code proposal?

Still thinking about it.. i want to do some tests but busy... :S

@maartenbreddels
Copy link
Member

I think this would be nice

df = vaex.from_scalars(user_name="Maarten", user_surname="Breddels")
df = df.struct.merge(join_char="_") # this will automatically collect all user_* into a column name user

and

df = vaex.datasets.titanic()
df = df.struct.merge({'person': ['name', 'age']} # will create a person struct column based on name and age 
or..
df = df.struct.merge({'Person': {'name':'Name', 'age':'Age']} # use a dict to rename?

@JovanVeljanoski
Copy link
Member

I like the proposal of @maartenbreddels above. The one correction/suggestion I would make is this

df['person'] = df.struct.merge(['name', 'age'])

df['person'] = df.struct.merge({'name':'Name', 'age':'Age'})

Although I have to say i do not know if merge is the right method name here.. i would naively that most methods in the struct namespace operate on structs rather than create them.. so .. something like struct.create or struct.from_expressions might be more explicit?

@maartenbreddels
Copy link
Member

Yes, since you can image 'df.struct` doing a type check, it also feels odd to me. But, this does organize all methods.

Can you start by writing a test, we can do a last-minute name change anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants