Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

Closed
karthikeyann opened this issue May 10, 2019 · 6 comments · Fixed by #3147
Closed
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@karthikeyann
Copy link
Contributor

karthikeyann commented May 10, 2019

Describe the bug
cuDF DataFrame treats first element of each tuple as column name, second element of each tuple as column, while pandas treats each tuple as row, when initializing from list of tuples.

cudf DataFrame behaviour is similar to pandas.DataFrame.from_items API which is deprecated.

Steps/Code to reproduce bug

In [1]: import cudf

In [2]: df = cudf.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [3]: df
Out[3]: <cudf.DataFrame ncols=3 nrows=20 >

In [4]: print(df)
   a  b  c
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]

In [5]: import pandas as pd

In [6]: pdf = pd.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [7]: print(pdf)
   0                                                  1
0  a  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1  b  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2  c  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

In [8]: pd.DataFrame.from_items([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/ipython:1: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  #!/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/python
Out[8]:
     a   b   c
0    0   0   0
1    1   1   1
2    2   2   2
3    3   3   3
4    4   4   4
5    5   5   5
6    6   6   6
7    7   7   7
8    8   8   8
9    9   9   9
10  10  10  10
11  11  11  11
12  12  12  12
13  13  13  13
14  14  14  14
15  15  15  15
16  16  16  16
17  17  17  17
18  18  18  18
19  19  19  19

Expected behavior
cudf DataFrame initialization should match with pandas dataframe initialization behavior.

Environment details :

  • Environment location: Bare-metal
  • Method of cuDF install: from source '0.8.0a+709.g58d05cf.dirty'
  • pandas 0.24.2 (installed by conda)

Additional context
Also fix the documentation in dataframe.iloc and related tests.

@karthikeyann karthikeyann added bug Something isn't working Needs Triage Need team to review and classify labels May 10, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 13, 2019
@cwharris cwharris self-assigned this Sep 18, 2019
@cwharris
Copy link
Contributor

cwharris commented Sep 18, 2019

There's a few ways to go about this:

Longest Dev Time + Better Performance
In terms of performance, Does libcudf have a way to transform row to columner format? If so, we could utilize that, rather than doing it at the python level.

Shorter Dev Time + "No Pandas"
We can accomplish the transformation by factoring out the relevant portions of from_pandas, and adding things like automatic column name generation.

Shortest Dev Time + 100% Compatibility
In terms of error handling, we have no guarantee that each tuple will contain the same data type for each position, nor the size of tuple, for that matter. It's possible to leverage Pandas to convert rows to a DataFrame, then use from_pandas - this way we could be sure the incoming data is formatted correctly.

I'd go with the shortest dev time, assuming we just want matching functionality and don't care about "code purity" or performance. That brings up an interesting question about the current implementation of __init__: Would it be reasonable to replace it entirely with a Pandas DataFrame constructor followed by from_pandas... ?

@harrism does libcudf happen to have rows => table conversion already?

@harrism
Copy link
Member

harrism commented Sep 19, 2019

I don't think we have that functionality.

@cwharris
Copy link
Contributor

cwharris commented Sep 20, 2019

@kkraus14 @shwina

of the options listed above, is there any that stands out as particularly apt?

@harrism
Copy link
Member

harrism commented Sep 23, 2019

@cwharris what exactly do you mean by "row to column" format? I assume you don't mean a transpose. There is no concept of a "row" in the traditional database sense in cuDF -- everything is stored in columns. Tables are always made up of columns.

In any case I think the shortest dev time approach is the right first step, since the request here isn't about performance, it's about compatibility. (Also, this is bug originates internally, not from an end user.)

@shwina
Copy link
Contributor

shwina commented Sep 23, 2019

Personally I think that no promise of performance can/should be made for the case of tuple inputs, especially given that the data is assumed to be rows in this case. I think going through Pandas in this case might be the best approach.

@cwharris
Copy link
Contributor

While we can update cudf.DataFrame.__init__ to match Pandas' behavior, the specific example given above won't work in cudf because we do not yet have nested (and therefore list-like) types #2857.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants