Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] join two cudf data frames with list columns #5621

Closed
rnyak opened this issue Jul 1, 2020 · 4 comments · Fixed by #5771
Closed

[FEA] join two cudf data frames with list columns #5621

rnyak opened this issue Jul 1, 2020 · 4 comments · Fixed by #5771
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@rnyak
Copy link
Contributor

rnyak commented Jul 1, 2020

Is your feature request related to a problem? Please describe.

Join method is used often at the preprocessing stages of ML and DL models. For recommender systems, one usage is to join two data frames with lists and/or nested lists columns. cuDF currently does not support that functionality.

Describe the solution you'd like

I'd like to be able to join two cudf data frames with list and/or nested list columns.

Additional context
Pandas support this operation (see example below).

import pandas as pd

nested = [[['1', '2', '3'], ['1','2']], [['2', '3', '1']], [['3', '4']], [['1']]]
doc_id= [1, 2, 3, 4]
df = pd.DataFrame({'doc_id': doc_id, 'col1': nested})

nested = [['1', '2', '3'], ['2', '3', '2'], ['3', '4', '5'], ['1'], ['2']]
ad_id= [1, 2, 3, 4, 5]
df2 = pd.DataFrame({'ad_id': ad_id, 'col2': nested})

df_merged = df.merge(df2, how='left', left_on='doc_id', right_on='ad_id')
df_merged
  doc_id                 col1  ad_id       col2
0       1  [[1, 2, 3], [1, 2]]      1  [1, 2, 3]
1       2          [[2, 3, 1]]      2  [2, 3, 2]
2       3             [[3, 4]]      3  [3, 4, 5]
3       4                [[1]]      4        [1]
@rnyak rnyak added Needs Triage Need team to review and classify feature request New feature or request labels Jul 1, 2020
@jrhemstad
Copy link
Contributor

To be clear, this isn't asking to join on list columns, but rather list columns come along for the ride in the dataframe.

I think this should largely "just work" once #5073 is done.

@rnyak
Copy link
Contributor Author

rnyak commented Jul 1, 2020

@jrhemstad yes, agreed.

@rnyak
Copy link
Contributor Author

rnyak commented Jul 21, 2020

To be clear, this isn't asking to join on list columns, but rather list columns come along for the ride in the dataframe.

I think this should largely "just work" once #5073 is done.

@jrhemstad I see that #5073 was merged. Does that mean we will have python api for cuDF list type column soon? Thanks.

@shwina
Copy link
Contributor

shwina commented Jul 22, 2020

@rnyak I'm working on bindings for gather/slice for list types in Python. It's currently waiting on a fix for #5717.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants