Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby with index = False returns NANs when column is categorical. #13204

Closed
toasteez opened this issue May 17, 2016 · 4 comments
Closed

groupby with index = False returns NANs when column is categorical. #13204

toasteez opened this issue May 17, 2016 · 4 comments
Milestone

Comments

@toasteez
Copy link

toasteez commented May 17, 2016

Please see stackoverflow for example of issue

http://stackoverflow.com/questions/37279260/why-doesnt-pandas-allow-a-categorical-column-to-be-used-in-groupby?noredirect=1#comment62084780_37279260

>>> pd.__version__
'0.18.1'
>>> 

# import the pandas module
import pandas as pd

# Create an example dataframe
raw_data = {'Date': ['2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13','2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13', '2016-05-13'],
    'Portfolio': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Duration': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
    'Yield': [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1],}

df = pd.DataFrame(raw_data, columns = ['Date', 'Portfolio', 'Duration', 'Yield'])

df['Portfolio'] = pd.Categorical(df['Portfolio'],['C', 'B', 'A'])
df=df.sort_values('Portfolio')

dfs = df.groupby(['Date','Portfolio'], as_index =False).sum()

print(dfs)

                        Date    Portfolio   Duration   Yield
Date        Portfolio               
13/05/2016  C           NaN     NaN         NaN        NaN
            B           NaN     NaN         NaN        NaN
            A           NaN     NaN         NaN        NaN
@jreback
Copy link
Contributor

jreback commented May 17, 2016

pls post an example & show_versions. SO links are nice, but an in-line example much better.

@dsm054
Copy link
Contributor

dsm054 commented May 17, 2016

FWIW my example would be something like

>>> pd.__version__
'0.18.1'
>>> 
>>> df = pd.DataFrame({"A": [1,1,1], "B": [2,2,2], "C": pd.Categorical([1,2,3])})
>>> df.groupby(["A","C"]).sum().reset_index()
   A  C  B
0  1  1  2
1  1  2  2
2  1  3  2
>>> df.groupby(["A","C"],as_index=False).sum()
      A   C   B
A C            
1 1 NaN NaN NaN
  2 NaN NaN NaN
  3 NaN NaN NaN

@jreback
Copy link
Contributor

jreback commented May 17, 2016

yeah, this is reindexing I think somewhere inside and is prob not setting it up right. pull-requests welcome.

@jreback jreback added this to the Next Major Release milestone May 17, 2016
@pijucha
Copy link
Contributor

pijucha commented Jun 6, 2016

Looks quite easy to fix. Function _reindex_output() doesn't take account of the variable self.as_index.

Another issue in the same function. The multiindex loses information about dtypes. For example:

df = pd.DataFrame({'cat': pd.Categorical([5,6,6,7,7], [5,6,7,8]),
                  'i1' : [10, 11, 11, 10, 11],
                  'i2' : [101,102,102,102,103]})

df.groupby(['cat', 'i1']).sum().reset_index().dtypes
Out[12]: 
cat      int64
i1       int64
i2     float64
dtype: object

While for a usual one level index:

df.groupby(['cat']).sum().reset_index().dtypes
Out[13]: 
cat    category
i1      float64
i2      float64
dtype: object

And I guess df.groupby(..., as_index=False).agg(...) should be consistent with df.groupby(..., as_index=True).agg(...).reset_index().

Edit: On second thought, I'd rather leave the index as it is. If a change is needed, it'd better be done in MultiIndex constructor, I suppose.

I'll prepare a PR for it later.


BTW, I couldn't find any info whether the following behaviour of categoricals in DataFrame is by design or just a side effect:

# df - same as above
df.sum()
Out[14]: 
cat     31.0
i1      53.0
i2     510.0
dtype: float64

df[['cat']].sum()
Out[15]: 
cat    31
dtype: int64

# while for Series:
df['cat'].sum()
...
TypeError: Categorical cannot perform the operation sum

Shouldn't categricals be rather excluded when aggregating as it is with datetime columns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants