Thoughts on DataFrame index not being columns, grouping issues #3275

wesm · 2013-04-08T04:11:37Z

from mailing list

I have a sort of philosophical question about the use of
indexes (especially MultiIndexes) versus just keeping data in
columns.  When using groupby, you tend to get a lot of results
with MultiIndexes, and the indexes are convenient for simple
accessing of items.  However, I've found that index objects lack
key features of ordinary columns.  I often find myself swapping a
particular dimension back and forth from index to column, either
because I need one or other, or because Pandas gives me one when
I want the other.  What I'm wondering is if I'm using pandas in a
nonidiomatic way, or if there's some way to get around these
difficulties I'm having, or what.

The three main things I've noticed right now to be irritating
about Index objects are:

A) Extracting and using the level values is awkward.  When I have
a column, I can get the values just with df.SomeCol or
df['SomeCol'].  For indexes, I have to do
df.index.get_level_values('IndexLevel'), and even then I just get
another Index instance, which I may have to convert to a series
for other things, because. . .

B) Indexes do not support the convenient convenient operations on
Series, in particular Series.map.  This means that, although I
can easily do df1.ix[df.SomeCol.map(someThingElse)], I cannot do
this when SomeCol is an index instead of just a column in the
data.  I have to extract the index level values as above and then
convert to a series before I can map them.

C) There doesn't appear to be a way to group a DataFrame by a
combination of columns and index levels.  groupby allows a "by"
argument for columns and a "level" argument for index levels, but
using both gives an error.  Even if I could do this, it's not
clear how I would specify the order of the grouping.

The solutions that come to mind for these problems are: A) give
MultiIndex objects a simple means of accessing the level values
as a Series.  Something like df.index.levels.Level or df
index.levels['Level'].  Basically make MultiIndexes indexable in
somewhat the same way that DataFrames already are. B) Give
Indexes a map-like operator, and maybe some of the other useful
stuff from Series.  C) Provide some way of grouping using both
columns and index levels.  Maybe some sort of "IndexGroup" class
that would wrap a level name, so you could do groupby(["Column",
IG("IndexLevel"), "OtherColumn"]) to insert an index level in the
grouping order.

Pandas provides a lot of functionality for slicing and dicing the
data in the different ways, but I feel like sometimes I'm forced
to slice it and dice it back and forth by converting indexes to
columns and vice versa instead of being able to directly access
what I want.  I'd be interested to hear how/whether other people
deal with these issues.  Are there ways of doing these things
that I'm missing?

The text was updated successfully, but these errors were encountered:

wesm · 2013-04-08T04:33:46Z

At the end of the day, the purpose of indexes is:

Easy value / row / group lookups
Default join behavior
Metadata management in reshape / transpose operations

That the index can be assigned to / replaced is also a feature.

However, it can also be a nuisance when you're using a DataFrame like an SQL table-- e.g. append(..)-ing DataFrames or things of that nature.

It's a tough call. The dual use of DataFrame as an in-memory database table and a "collection of labeled arrays" has made it hard to be all things to all people. The former use case came later, which is why things aren't quite perfectly designed for the database-type use case. More work to do and probably room for a slightly different table object in pandas.

ghost · 2013-04-08T11:32:41Z

I agree with the spirit of the argument, that there could be more consistency in the API
(#3268, #3070, #413).

Index/Multindex does support map, but the result can't easily be turned into the bool indexer
he seems to be after. Trivial to solve in this case I think:
#3275 solves B directly, also A either directly or when used in conjunction with df.index.levels[i]

C is tricky.

edit: #3275 merged into master (.to_series() is the new bit):

In [133]: df=mkdf(5,2,r_idx_nlevels=2)
In [134]: df
Out[134]: 
C0              C_l0_g0 C_l0_g1
R0      R1                     
R_l0_g0 R_l1_g0    R0C0    R0C1
R_l0_g1 R_l1_g1    R1C0    R1C1
R_l0_g2 R_l1_g2    R2C0    R2C1
R_l0_g3 R_l1_g3    R3C0    R3C1
R_l0_g4 R_l1_g4    R4C0    R4C1

In [13]: # so now this is possible
In [135]: df[df.index.to_series().map(lambda x: "l0_g2" in x[0])]
Out[135]: 
C0              C_l0_g0 C_l0_g1
R0      R1                     
R_l0_g2 R_l1_g2    R2C0    R2C1

cpcloud · 2013-04-11T23:42:22Z

+1 I often find myself getting into the underlying array of tuples because of these issues.

alvorithm · 2013-04-26T12:40:35Z

I have trouble for merges, and my solution so far is a very wasteful one: to keep a named index as well as a column copy of it. Things would much improve for merges if upon a KeyError in resolving a column name, the index names would be used before finally giving up.

It seems that this would greatly enhance the relational usability of Pandas. Would this have any unwanted side-effects?

ghost · 2013-04-26T12:44:41Z

Didn't catch that. code example?

alvorithm · 2013-04-26T13:08:25Z

Ok, that was insufficient because I did not explain the scenario:

I keep track of the foreign keys for each table and have an automated merge mechanism based on column names. When going for merges, the asymmetry between the left|right_index and the on parameters in pandas.merge creates the problem that would be solved by making indexes also addressable by name (if they have one) in a fallback-manner. If it is still not clear I can dig up an example.

WillAyd · 2018-07-06T22:22:07Z

Closing due to age and request ambiguity

ghost mentioned this issue Apr 8, 2013

ENH: add to_series() method to Index and subclasses GH3275 #3280

Merged

TomAugspurger mentioned this issue Dec 11, 2013

ENH/API: clarify groupby by to handle columns/index names #5677

Closed

WillAyd modified the milestones: Someday, No action Jul 6, 2018

WillAyd closed this as completed Jul 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on DataFrame index not being columns, grouping issues #3275

Thoughts on DataFrame index not being columns, grouping issues #3275

wesm commented Apr 8, 2013

wesm commented Apr 8, 2013

ghost commented Apr 8, 2013

cpcloud commented Apr 11, 2013

alvorithm commented Apr 26, 2013

ghost commented Apr 26, 2013

alvorithm commented Apr 26, 2013

WillAyd commented Jul 6, 2018

Thoughts on DataFrame index not being columns, grouping issues #3275

Thoughts on DataFrame index not being columns, grouping issues #3275

Comments

wesm commented Apr 8, 2013

wesm commented Apr 8, 2013

ghost commented Apr 8, 2013

cpcloud commented Apr 11, 2013

alvorithm commented Apr 26, 2013

ghost commented Apr 26, 2013

alvorithm commented Apr 26, 2013

WillAyd commented Jul 6, 2018