Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on DataFrame index not being columns, grouping issues #3275

Closed
wesm opened this issue Apr 8, 2013 · 7 comments
Closed

Thoughts on DataFrame index not being columns, grouping issues #3275

wesm opened this issue Apr 8, 2013 · 7 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@wesm
Copy link
Member

wesm commented Apr 8, 2013

from mailing list

I have a sort of philosophical question about the use of
indexes (especially MultiIndexes) versus just keeping data in
columns.  When using groupby, you tend to get a lot of results
with MultiIndexes, and the indexes are convenient for simple
accessing of items.  However, I've found that index objects lack
key features of ordinary columns.  I often find myself swapping a
particular dimension back and forth from index to column, either
because I need one or other, or because Pandas gives me one when
I want the other.  What I'm wondering is if I'm using pandas in a
nonidiomatic way, or if there's some way to get around these
difficulties I'm having, or what.

The three main things I've noticed right now to be irritating
about Index objects are:

A) Extracting and using the level values is awkward.  When I have
a column, I can get the values just with df.SomeCol or
df['SomeCol'].  For indexes, I have to do
df.index.get_level_values('IndexLevel'), and even then I just get
another Index instance, which I may have to convert to a series
for other things, because. . .

B) Indexes do not support the convenient convenient operations on
Series, in particular Series.map.  This means that, although I
can easily do df1.ix[df.SomeCol.map(someThingElse)], I cannot do
this when SomeCol is an index instead of just a column in the
data.  I have to extract the index level values as above and then
convert to a series before I can map them.

C) There doesn't appear to be a way to group a DataFrame by a
combination of columns and index levels.  groupby allows a "by"
argument for columns and a "level" argument for index levels, but
using both gives an error.  Even if I could do this, it's not
clear how I would specify the order of the grouping.

The solutions that come to mind for these problems are: A) give
MultiIndex objects a simple means of accessing the level values
as a Series.  Something like df.index.levels.Level or df
index.levels['Level'].  Basically make MultiIndexes indexable in
somewhat the same way that DataFrames already are. B) Give
Indexes a map-like operator, and maybe some of the other useful
stuff from Series.  C) Provide some way of grouping using both
columns and index levels.  Maybe some sort of "IndexGroup" class
that would wrap a level name, so you could do groupby(["Column",
IG("IndexLevel"), "OtherColumn"]) to insert an index level in the
grouping order.

Pandas provides a lot of functionality for slicing and dicing the
data in the different ways, but I feel like sometimes I'm forced
to slice it and dice it back and forth by converting indexes to
columns and vice versa instead of being able to directly access
what I want.  I'd be interested to hear how/whether other people
deal with these issues.  Are there ways of doing these things
that I'm missing?
@wesm
Copy link
Member Author

wesm commented Apr 8, 2013

At the end of the day, the purpose of indexes is:

  1. Easy value / row / group lookups
  2. Default join behavior
  3. Metadata management in reshape / transpose operations

That the index can be assigned to / replaced is also a feature.

However, it can also be a nuisance when you're using a DataFrame like an SQL table-- e.g. append(..)-ing DataFrames or things of that nature.

It's a tough call. The dual use of DataFrame as an in-memory database table and a "collection of labeled arrays" has made it hard to be all things to all people. The former use case came later, which is why things aren't quite perfectly designed for the database-type use case. More work to do and probably room for a slightly different table object in pandas.

@ghost
Copy link

ghost commented Apr 8, 2013

I agree with the spirit of the argument, that there could be more consistency in the API
(#3268, #3070, #413).

Index/Multindex does support map, but the result can't easily be turned into the bool indexer
he seems to be after. Trivial to solve in this case I think:
#3275 solves B directly, also A either directly or when used in conjunction with df.index.levels[i]

C is tricky.

edit: #3275 merged into master (.to_series() is the new bit):

In [133]: df=mkdf(5,2,r_idx_nlevels=2)
In [134]: df
Out[134]: 
C0              C_l0_g0 C_l0_g1
R0      R1                     
R_l0_g0 R_l1_g0    R0C0    R0C1
R_l0_g1 R_l1_g1    R1C0    R1C1
R_l0_g2 R_l1_g2    R2C0    R2C1
R_l0_g3 R_l1_g3    R3C0    R3C1
R_l0_g4 R_l1_g4    R4C0    R4C1

In [13]: # so now this is possible
In [135]: df[df.index.to_series().map(lambda x: "l0_g2" in x[0])]
Out[135]: 
C0              C_l0_g0 C_l0_g1
R0      R1                     
R_l0_g2 R_l1_g2    R2C0    R2C1

@cpcloud
Copy link
Member

cpcloud commented Apr 11, 2013

+1 I often find myself getting into the underlying array of tuples because of these issues.

@alvorithm
Copy link

I have trouble for merges, and my solution so far is a very wasteful one: to keep a named index as well as a column copy of it. Things would much improve for merges if upon a KeyError in resolving a column name, the index names would be used before finally giving up.

It seems that this would greatly enhance the relational usability of Pandas. Would this have any unwanted side-effects?

@ghost
Copy link

ghost commented Apr 26, 2013

Didn't catch that. code example?

@alvorithm
Copy link

Ok, that was insufficient because I did not explain the scenario:

I keep track of the foreign keys for each table and have an automated merge mechanism based on column names. When going for merges, the asymmetry between the left|right_index and the on parameters in pandas.merge creates the problem that would be solved by making indexes also addressable by name (if they have one) in a fallback-manner. If it is still not clear I can dig up an example.

@WillAyd WillAyd modified the milestones: Someday, No action Jul 6, 2018
@WillAyd
Copy link
Member

WillAyd commented Jul 6, 2018

Closing due to age and request ambiguity

@WillAyd WillAyd closed this as completed Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants