Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Passing multiple levels to stack when having mixed integer/string level names #8584

Closed
jorisvandenbossche opened this issue Oct 19, 2014 · 5 comments · Fixed by #8809
Closed
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Related #7770

Using the example of the docs (http://pandas.pydata.org/pandas-docs/stable/reshaping.html#multiple-levels):

columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')], 
                                 names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)

CONTEXT: df.stack(level=['animal', 'hair_length']) and df.stack(level=[1, 2]) are equivalent (feature introduced in #7770). Mixing integers location and string names (eg df.stack(level=['animal', 2])) gives a ValueError.

But if you have level names of mixed types, some different (and wrong things) happen:

  • With a total different number, it still works as it should:

    df.columns.names = ['exp', 'animal', 10]
    df.stack(level=['animal', 10])
    
  • With the number 1, it treats the 1 as a level number instead of the level name, leading to a wrong result (two times the same level unstacked):

    In [42]: df.columns.names = ['exp', 'animal', 1]
    
    In [43]: df.stack(level=['animal', 1])
    Out[43]: 
    exp                     A         B
      animal animal                    
    0 cat    cat    -1.006065  0.401136
      dog    dog     0.526734 -1.753478
    1 cat    cat    -0.718401 -0.400386
      dog    dog    -0.951336 -1.074323
    2 cat    cat     1.119843 -0.606982
      dog    dog     0.371467 -1.837341
    3 cat    cat    -1.467968  1.114524
      dog    dog    -0.040112  0.240026
    
  • With the number 0, it gives a strange error:

    In [46]: df.columns.names = ['exp', 'animal', 0]
    
    In [47]: df.stack(level=['animal', 0])
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-47-4e9507e0708f> in <module>()
    ----> 1 df.stack(level=['animal', 0])
    
    /home/joris/scipy/pandas/pandas/core/frame.pyc in stack(self, level, dropna)
    3390 
    3391         if isinstance(level, (tuple, list)):
    -> 3392             return stack_multiple(self, level, dropna=dropna)
    3393         else:
    3394             return stack(self, level, dropna=dropna)
    
    ....
    
    /home/joris/scipy/pandas/pandas/core/index.pyc in _partial_tup_index(self, tup, side)
    3820             raise KeyError('Key length (%d) was greater than MultiIndex'
    3821                            ' lexsort depth (%d)' %
    -> 3822                            (len(tup), self.lexsort_depth))
    3823 
    3824         n = len(tup)
    
    KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (0)'
    
@jreback
Copy link
Contributor

jreback commented Oct 19, 2014

hmm, so this is an api issue then? I think we should be very strict on this as we cannot disambiguate easy (e.g. .ix/.loc issues).

  • Integers must be treated always as positional and don't allow mixed integers / names
  • if integer-like level names then can be passed as strings (and not actual integers; not sure if this will break anything)

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 19, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 19, 2014
@jorisvandenbossche
Copy link
Member Author

I think this are examples that we can disambiguate.

I understood from that PR that the new logic was:

  • if all entries (strings or ints) are found in the level names -> use as level names
  • if not all found:
    • if all integers -> use as level locations
    • if not all integers -> raise ValueError

So following that, these case should / can work I think (and it does work in some cases, so at least it is a bit inconstent).

And if that logic is correct (it is the logic we want to follow), that should maybe be also mentioned in the docstring.

@onesandzeroes
Copy link
Contributor

I agree that these two should work, since all the levels are in the level names:

In [42]: df.columns.names = ['exp', 'animal', 1]
In [43]: df.stack(level=['animal', 1])

And

In [46]: df.columns.names = ['exp', 'animal', 0]
In [47]: df.stack(level=['animal', 0])

I had a look tonight and I think I have a fix for both cases, we just need to be a bit more careful about when we're dealing with level names and when we're dealing with level numbers. If I can get these cases working, then I think the logic you've outlined (which was the original intent of the PR) still holds. Probably a good idea to add it to the docstring though.

@onesandzeroes
Copy link
Contributor

The simplest solution I came up with for this involved adding an as_level_numbers=False flag to MultiIndex.swaplevel(), so I could use as_level_numbers=True to signal that the levels being passed were already level numbers, skipping the _get_level_number() step.

Would this be OK to add to the API, or should I add this behaviour in a new method like MultiIndex._swaplevel_using_level_numbers()? Seems like it could be somewhat useful if you ever need to force swaplevel to deal with the passed levels as numbers, but it might break consistency.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2014

@onesandzeroes you can make an internal function (leading '_') if you need, but this shouldn't be exposed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants