BUG: groupby upon categorical and sort=False triggers ValueError #13179

mpschr · 2016-05-14T18:30:12Z

Code that triggers ValueError

The combination of sort=False and a missing category in the data causes the bug - see below

First off, see this notebook which showcases the bug nicely: github.com/mpschr/pandas_missing_cat_bug

random.seed(88)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
chromosomes = [str(x) for x in range(1,23)] + ["X","Y"]
df.insert(0, 'chromosomes', sorted([random.choice(chromosomes) for x in range(100)]))
df.chromosomes = df.chromosomes.astype('category', categories=chromosomes, ordered=True)

for c, g in df.query("chromosomes != '1'").groupby('chromosomes', sort=False):
    print(c, g.chromosomes.cat.categories, g.shape)


/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/groupby.py in __init__(self, index, grouper, obj, name, level, sort, in_axis)
   2181                     cat = self.grouper.unique()
   2182                     self.grouper = self.grouper.reorder_categories(
-> 2183                         cat.categories)
   2184 
   2185                 # we make a CategoricalIndex out of the cat grouper

/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/categorical.py in reorder_categories(self, new_categories, ordered, inplace)
    756         """
    757         if set(self._categories) != set(new_categories):
--> 758             raise ValueError("items in new_categories are not the same as in "
    759                              "old categories")
    760         return self.set_categories(new_categories, ordered=ordered,

ValueError: items in new_categories are not the same as in old categories

Summaries of the scenarios where this bug appears:

Bug scenarios with ordered categories:

Default (sort = True): No error
chromosome 1 filtered out and sort=True: No error
chromosome 1 filtered out and sort=False: Error
sort = False: Error

Bug scenarios without ordered categories:

the 4 scenarios:

Default (sort = True): No error
chromosome 1 filtered out and sort=True: No error
sort = False: No error
chromosome 1 filtered out and sort=False: Error

Expected Output

Not an error, but this:


1 Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
       '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'],
      dtype='object') (7, 5)

output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.22
numpy: 1.10.4
scipy: 0.16.0
statsmodels: 0.6.0.dev-9ce1605
xarray: None
IPython: 4.1.2
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.36.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

mpschr · 2016-05-14T18:32:40Z

I cannot be a 100% sure, but it seems that this bug is closely related with: #10505 and #10508

jreback · 2016-05-14T19:13:42Z

So the purpose of the .unique() here is to put the categoricals in the order of appearance, BUT, crucially unused categories are removed (and that's the error that's popping up).

So in this case you are removing the values for category '1', BUT that should still show up in the results as its a categorical.

for sort=True, actually is already sorted in order of the categoricals.

In [7]: df.query("chromosomes != '1'").groupby('chromosomes').A.sum()
Out[7]: 
chromosomes
1       NaN
2     157.0
3     115.0
4     477.0
5     274.0
6     172.0
7     221.0
8     290.0
9     231.0
10    434.0
11    196.0
12    243.0
13    109.0
14    217.0
15     89.0
16    193.0
17    417.0
18     58.0
19    149.0
20    144.0
21    166.0
22    334.0
X     147.0
Y     316.0
Name: A, dtype: float64

I suppose for sort=False you then can put the NA groups at the front or back (e.g the '1' group), the remainder will then be in the order of appearance (e.g. the uniquie).

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? not really sure why we are excluding unused ones.

cc @jorisvandenbossche
cc @JanSchulz

jreback · 2016-05-14T19:14:09Z

cc @sinhrks

mpschr · 2016-05-15T09:04:00Z

Hi @jreback - Thanks for receiving the bug report. I have just a little doubt as a layman here: is it convention to return a 'group' (in the groupby) for all the categories even tough there is no data for them available in the supplied data?

Imagine I make a query for just chromosomes 4 and 5 for whatever (biological investigative) reason - I would not expect results back for the other chromosomes I think (as follows):

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum()

chromosomes
4     195.0
5     394.0
Name: A, dtype: float64


# as opposed to :

chromosomes
1       NaN
2       NaN
3       NaN
4     195.0
5     394.0
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19      NaN
20      NaN
21      NaN
22      NaN
X       NaN
Y       NaN
Name: A, dtype: float64

jreback · 2016-05-15T13:58:40Z

@mpschr yes, this is the purpose of Categoricals. to return full categories. You made an explict choice to use them and so it must be explict to drop them; that is the point here.

If you think about it would be buggy to remove them! IOW, how would the code know its 'ok' to drop them?

mpschr · 2016-05-15T16:17:10Z

Hi @jreback I am not sure if we are talking about the same thing. I elaborte: I was referring to the data available in the DataFrame. Of course the categories which have been established as categories in df.chromosomes.cat.categories should never be dropped - even tough they are not represented in the DataFrame. Exactly as shown here:

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes
71    4
72    4
73    4
74    5
75    5
76    5
77    5
78    5
79    5
80    5
81    5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

But, analogously to this I would expect the following output after doing groupby:

df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum().reset_index().chromosomes

#expected output:
chromosomes
1      4
2      5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

#but actual output is.

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
20    21
21    22
22     X
23     Y
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

The actual output (2nd option) we get here is misleading since all chromosomes except 4 and 5 are not in the supplied data, they are just 'acceptable' options. Is it possible that this two different viewpoints may contribute to the bug reported here?

jankatins · 2016-05-16T16:24:45Z

@mpschr:
There is a different "view" for categoricals and groupby: if I have a lickert scale and want to get number of times each value was ticked, I want "unused" groups to show up as 0. That was at least the idea behind having all groups show up in groupby and such things.

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? Not really sure why we are excluding unused ones.

I think there was a specific reason why unique is now not returning the whole categories (AFAIK remember the first implementation simply returned the categories). I think because someone argued that the implicit API contract for unique ist that it returns only used values (and ordered in in appearance as that was what seaborn/plots expected).

mpschr · 2016-05-17T07:13:27Z

Ok, so this is the current behaviour:

# 1.

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (2, object): [4 < 5]

again - here what a layman like me would expect is the following.

# 2.

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

Now I understood the bug :) The seaborn library should be able to work with the unique used values as in example 2, right?

jreback · 2016-05-17T14:00:20Z

@mpschr not sure what you mean. This is as expected. The point is that the category dtype IS propogated to ALL operations. There is extensive documentation on this. What exactly is not clear? (the bug in this issue is independent / not related to this).

In [8]: df[df.chromosomes.isin(query_chroms)].chromosomes
Out[8]: 
61    4
62    4
63    4
64    4
65    4
66    4
67    4
68    4
69    5
70    5
71    5
72    5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

mpschr · 2016-05-17T14:10:57Z

Yep @jreback - I think I went a bit off-topic with the groupby behaviour (including unused categories in the output of group aggregations).

In any case I totally agree with you on the matter with the unique behavior, as posted in my last comment. The unused categories should not be discarded from the cat.categories when gettting df.chromosomes.unique()

jreback · 2016-05-17T15:04:41Z

@JanSchulz

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? Not really sure why we are excluding unused ones.

I think there was a specific reason why unique is now not returning the whole categories (AFAIK remember the first implementation simply returned the categories). I think because someone argued that the implicit API contract for unique ist that it returns only used values (and ordered in in appearance as that was what seaborn/plots expected).

yeah I don't really recall all of the discussion about .unique (though there were many!).

Yeah I can see how we just return the observed values

mpschr · 2017-02-01T14:24:54Z

Has this not been fixed yet (just curiosity)

jreback · 2017-02-01T14:28:22Z

@mpschr issues get closed when they are fixed. you are welcome to submit a PR to fix this. Community PR's push things along.

closes pandas-dev#13179 Author: Kernc <kerncece@gmail.com> Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits: 55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing 2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing c813146 [Kernc] PERF: add asv for categorical grouping 0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing

mpschr mentioned this issue May 14, 2016

Fix genes not split correctly for gainloss report etal/cnvkit#108

Closed

jreback added Bug Groupby Categorical Categorical Data Type Difficulty Intermediate labels May 14, 2016

jreback added this to the 0.18.2 milestone May 14, 2016

jreback modified the milestones: 0.19.0, Next Major Release Sep 28, 2016

kernc mentioned this issue Feb 17, 2017

BUG: Categorical.unique() preserves categories #15439

Closed

4 tasks

jreback closed this as completed in f638550 Feb 22, 2017

jreback modified the milestones: 0.20.0, Next Major Release Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby upon categorical and sort=False triggers ValueError #13179

BUG: groupby upon categorical and sort=False triggers ValueError #13179

mpschr commented May 14, 2016

mpschr commented May 14, 2016

jreback commented May 14, 2016

jreback commented May 14, 2016

mpschr commented May 15, 2016 •

edited

Loading

jreback commented May 15, 2016

mpschr commented May 15, 2016 •

edited

Loading

jankatins commented May 16, 2016

mpschr commented May 17, 2016 •

edited

Loading

jreback commented May 17, 2016

mpschr commented May 17, 2016

jreback commented May 17, 2016

mpschr commented Feb 1, 2017

jreback commented Feb 1, 2017

BUG: groupby upon categorical and sort=False triggers ValueError #13179

BUG: groupby upon categorical and sort=False triggers ValueError #13179

Comments

mpschr commented May 14, 2016

Code that triggers ValueError

Summaries of the scenarios where this bug appears:

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

mpschr commented May 14, 2016

jreback commented May 14, 2016

jreback commented May 14, 2016

mpschr commented May 15, 2016 • edited Loading

jreback commented May 15, 2016

mpschr commented May 15, 2016 • edited Loading

jankatins commented May 16, 2016

mpschr commented May 17, 2016 • edited Loading

jreback commented May 17, 2016

mpschr commented May 17, 2016

jreback commented May 17, 2016

mpschr commented Feb 1, 2017

jreback commented Feb 1, 2017

output of `pd.show_versions()`

mpschr commented May 15, 2016 •

edited

Loading

mpschr commented May 15, 2016 •

edited

Loading

mpschr commented May 17, 2016 •

edited

Loading