[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

karthikeyann · 2019-05-10T12:02:41Z

Describe the bug
cuDF DataFrame treats first element of each tuple as column name, second element of each tuple as column, while pandas treats each tuple as row, when initializing from list of tuples.

cudf DataFrame behaviour is similar to pandas.DataFrame.from_items API which is deprecated.

Steps/Code to reproduce bug

In [1]: import cudf

In [2]: df = cudf.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [3]: df
Out[3]: <cudf.DataFrame ncols=3 nrows=20 >

In [4]: print(df)
   a  b  c
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]

In [5]: import pandas as pd

In [6]: pdf = pd.DataFrame([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])

In [7]: print(pdf)
   0                                                  1
0  a  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1  b  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2  c  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

In [8]: pd.DataFrame.from_items([('a', list(range(20))), ('b', list(range(20))), ('c', list(range(20)))])
/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/ipython:1: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  #!/home/nfs/knataraj/miniconda3/envs/cudf_dev/bin/python
Out[8]:
     a   b   c
0    0   0   0
1    1   1   1
2    2   2   2
3    3   3   3
4    4   4   4
5    5   5   5
6    6   6   6
7    7   7   7
8    8   8   8
9    9   9   9
10  10  10  10
11  11  11  11
12  12  12  12
13  13  13  13
14  14  14  14
15  15  15  15
16  16  16  16
17  17  17  17
18  18  18  18
19  19  19  19

Expected behavior
cudf DataFrame initialization should match with pandas dataframe initialization behavior.

Environment details :

Environment location: Bare-metal
Method of cuDF install: from source '0.8.0a+709.g58d05cf.dirty'
pandas 0.24.2 (installed by conda)

Additional context
Also fix the documentation in dataframe.iloc and related tests.

The text was updated successfully, but these errors were encountered:

cwharris · 2019-09-18T19:35:45Z

There's a few ways to go about this:

Longest Dev Time + Better Performance
In terms of performance, Does libcudf have a way to transform row to columner format? If so, we could utilize that, rather than doing it at the python level.

Shorter Dev Time + "No Pandas"
We can accomplish the transformation by factoring out the relevant portions of from_pandas, and adding things like automatic column name generation.

Shortest Dev Time + 100% Compatibility
In terms of error handling, we have no guarantee that each tuple will contain the same data type for each position, nor the size of tuple, for that matter. It's possible to leverage Pandas to convert rows to a DataFrame, then use from_pandas - this way we could be sure the incoming data is formatted correctly.

I'd go with the shortest dev time, assuming we just want matching functionality and don't care about "code purity" or performance. That brings up an interesting question about the current implementation of __init__: Would it be reasonable to replace it entirely with a Pandas DataFrame constructor followed by from_pandas... ?

@harrism does libcudf happen to have rows => table conversion already?

harrism · 2019-09-19T03:56:31Z

I don't think we have that functionality.

cwharris · 2019-09-20T22:02:53Z

@kkraus14 @shwina

of the options listed above, is there any that stands out as particularly apt?

harrism · 2019-09-23T03:28:26Z

@cwharris what exactly do you mean by "row to column" format? I assume you don't mean a transpose. There is no concept of a "row" in the traditional database sense in cuDF -- everything is stored in columns. Tables are always made up of columns.

In any case I think the shortest dev time approach is the right first step, since the request here isn't about performance, it's about compatibility. (Also, this is bug originates internally, not from an end user.)

shwina · 2019-09-23T13:46:29Z

Personally I think that no promise of performance can/should be made for the case of tuple inputs, especially given that the data is assumed to be rows in this case. I think going through Pandas in this case might be the best approach.

cwharris · 2019-10-17T19:36:23Z

While we can update cudf.DataFrame.__init__ to match Pandas' behavior, the specific example given above won't work in cudf because we do not yet have nested (and therefore list-like) types #2857.

karthikeyann added bug Something isn't working Needs Triage Need team to review and classify labels May 10, 2019

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 13, 2019

cwharris self-assigned this Sep 18, 2019

cwharris mentioned this issue Oct 17, 2019

[REVIEW] init Dataframe from rows via list of tuples #3147

Merged

kkraus14 closed this as completed in #3147 Oct 23, 2019

cwharris mentioned this issue Nov 27, 2019

[BUG] df tuple constructor regression #3483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

karthikeyann commented May 10, 2019 •

edited

Loading

cwharris commented Sep 18, 2019 •

edited

Loading

harrism commented Sep 19, 2019

cwharris commented Sep 20, 2019 •

edited

Loading

harrism commented Sep 23, 2019

shwina commented Sep 23, 2019

cwharris commented Oct 17, 2019

[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

[BUG] inconsistent behaviour of cudf.DataFrame and pandas.DataFrame from list of tuples #1705

Comments

karthikeyann commented May 10, 2019 • edited Loading

cwharris commented Sep 18, 2019 • edited Loading

harrism commented Sep 19, 2019

cwharris commented Sep 20, 2019 • edited Loading

harrism commented Sep 23, 2019

shwina commented Sep 23, 2019

cwharris commented Oct 17, 2019

karthikeyann commented May 10, 2019 •

edited

Loading

cwharris commented Sep 18, 2019 •

edited

Loading

cwharris commented Sep 20, 2019 •

edited

Loading