Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.stack with MultiIndex columns does not preserve order #15104

Closed
amanlai opened this issue Feb 21, 2024 · 3 comments · Fixed by #15100
Closed

[BUG] DataFrame.stack with MultiIndex columns does not preserve order #15104

amanlai opened this issue Feb 21, 2024 · 3 comments · Fixed by #15100
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@amanlai
Copy link

amanlai commented Feb 21, 2024

Describe the bug

A call to .stack() on a DataFrame with MultiIndex columns by some level results in a DataFrame whose columns are sorted lexicographically (does not preserve the original order).

The following constructor:

df = cudf.DataFrame({('b', 3): [1, 2], ('a', 2): [3, 4]})

creates a DataFrame that looks like

   b  a
   3  2
0  1  3
1  2  4

Stacking it using df.stack() produces

       a    b
0 2  NaN  3.0
  3  1.0  NaN
1 2  NaN  4.0
  3  2.0  NaN

Since the original column order was 'b' and 'a', the expected output is

       b    a
0 2  NaN  3.0
  3  1.0  NaN
1 2  NaN  4.0
  3  2.0  NaN

This is reproducible on Google Colab using the Rapids installer.

On a related note, I believe stack() in pandas behave like this in older versions (e.g. 1.5) but it preserves the original order in newer versions (e.g. 2.2).

@amanlai amanlai added the bug Something isn't working label Feb 21, 2024
@shwina
Copy link
Contributor

shwina commented Feb 21, 2024

Thanks for reporting, @amanlai! cc: @galipremsagar any chance this is fixed in the upcoming 2.2 compatibility changes?

@shwina shwina added the Python Affects Python cuDF API. label Feb 21, 2024
@galipremsagar
Copy link
Contributor

Yes, it is fixed in 24.04:

In [1]: import cudf


In [2]: df = cudf.DataFrame({('b', 3): [1, 2], ('a', 2): [3, 4]})

In [3]: 

In [3]: df
Out[3]: 
   b  a
   3  2
0  1  3
1  2  4

In [4]: df.stack()
/nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/dataframe.py:6848: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of cudf. Specify future_stack=True to adopt the new implementation and silence this warning.
  warnings.warn(
Out[4]: 
        b     a
0 2  <NA>     3
  3     1  <NA>
1 2  <NA>     4
  3     2  <NA>

In [5]: df.to_pandas().stack()
<ipython-input-5-2234f1c7afa9>:1: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  df.to_pandas().stack()
Out[5]: 
       b    a
0 2  NaN  3.0
  3  1.0  NaN
1 2  NaN  4.0
  3  2.0  NaN

@galipremsagar galipremsagar linked a pull request Feb 21, 2024 that will close this issue
5 tasks
@shwina
Copy link
Contributor

shwina commented Feb 21, 2024

Fantastic! @amanlai our nightly packages should contain these fixes if you're interested in trying them out! I'm closing this issue for now, but please feel free to reopen if you require further help!

@shwina shwina closed this as completed Feb 21, 2024
@bdice bdice changed the title [BUG] [BUG] DataFrame.stack with MultiIndex columns does not preserve order Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants