Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby upon categorical and sort=False triggers ValueError #13179

Closed
mpschr opened this issue May 14, 2016 · 13 comments
Closed

BUG: groupby upon categorical and sort=False triggers ValueError #13179

mpschr opened this issue May 14, 2016 · 13 comments
Labels
Bug Categorical Categorical Data Type Groupby
Milestone

Comments

@mpschr
Copy link

mpschr commented May 14, 2016

Code that triggers ValueError

The combination of sort=False and a missing category in the data causes the bug - see below

First off, see this notebook which showcases the bug nicely: github.com/mpschr/pandas_missing_cat_bug

random.seed(88)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
chromosomes = [str(x) for x in range(1,23)] + ["X","Y"]
df.insert(0, 'chromosomes', sorted([random.choice(chromosomes) for x in range(100)]))
df.chromosomes = df.chromosomes.astype('category', categories=chromosomes, ordered=True)

for c, g in df.query("chromosomes != '1'").groupby('chromosomes', sort=False):
    print(c, g.chromosomes.cat.categories, g.shape)


/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/groupby.py in __init__(self, index, grouper, obj, name, level, sort, in_axis)
   2181                     cat = self.grouper.unique()
   2182                     self.grouper = self.grouper.reorder_categories(
-> 2183                         cat.categories)
   2184 
   2185                 # we make a CategoricalIndex out of the cat grouper

/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/categorical.py in reorder_categories(self, new_categories, ordered, inplace)
    756         """
    757         if set(self._categories) != set(new_categories):
--> 758             raise ValueError("items in new_categories are not the same as in "
    759                              "old categories")
    760         return self.set_categories(new_categories, ordered=ordered,

ValueError: items in new_categories are not the same as in old categories
Summaries of the scenarios where this bug appears:

Bug scenarios with ordered categories:

  • Default (sort = True): No error
  • chromosome 1 filtered out and sort=True: No error
  • chromosome 1 filtered out and sort=False: Error
  • sort = False: Error

Bug scenarios without ordered categories:

the 4 scenarios:

  • Default (sort = True): No error
  • chromosome 1 filtered out and sort=True: No error
  • sort = False: No error
  • chromosome 1 filtered out and sort=False: Error

Expected Output

Not an error, but this:


1 Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
       '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'],
      dtype='object') (7, 5)

output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.22
numpy: 1.10.4
scipy: 0.16.0
statsmodels: 0.6.0.dev-9ce1605
xarray: None
IPython: 4.1.2
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.36.0
pandas_datareader: None

@mpschr
Copy link
Author

mpschr commented May 14, 2016

I cannot be a 100% sure, but it seems that this bug is closely related with: #10505 and #10508

@jreback
Copy link
Contributor

jreback commented May 14, 2016

So the purpose of the .unique() here is to put the categoricals in the order of appearance, BUT, crucially unused categories are removed (and that's the error that's popping up).

So in this case you are removing the values for category '1', BUT that should still show up in the results as its a categorical.

for sort=True, actually is already sorted in order of the categoricals.

In [7]: df.query("chromosomes != '1'").groupby('chromosomes').A.sum()
Out[7]: 
chromosomes
1       NaN
2     157.0
3     115.0
4     477.0
5     274.0
6     172.0
7     221.0
8     290.0
9     231.0
10    434.0
11    196.0
12    243.0
13    109.0
14    217.0
15     89.0
16    193.0
17    417.0
18     58.0
19    149.0
20    144.0
21    166.0
22    334.0
X     147.0
Y     316.0
Name: A, dtype: float64

I suppose for sort=False you then can put the NA groups at the front or back (e.g the '1' group), the remainder will then be in the order of appearance (e.g. the uniquie).

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? not really sure why we are excluding unused ones.

cc @jorisvandenbossche
cc @JanSchulz

@jreback jreback added this to the 0.18.2 milestone May 14, 2016
@jreback
Copy link
Contributor

jreback commented May 14, 2016

cc @sinhrks

@mpschr
Copy link
Author

mpschr commented May 15, 2016

Hi @jreback - Thanks for receiving the bug report. I have just a little doubt as a layman here: is it convention to return a 'group' (in the groupby) for all the categories even tough there is no data for them available in the supplied data?

Imagine I make a query for just chromosomes 4 and 5 for whatever (biological investigative) reason - I would not expect results back for the other chromosomes I think (as follows):

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum()

chromosomes
4     195.0
5     394.0
Name: A, dtype: float64


# as opposed to :

chromosomes
1       NaN
2       NaN
3       NaN
4     195.0
5     394.0
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19      NaN
20      NaN
21      NaN
22      NaN
X       NaN
Y       NaN
Name: A, dtype: float64

@jreback
Copy link
Contributor

jreback commented May 15, 2016

@mpschr yes, this is the purpose of Categoricals. to return full categories. You made an explict choice to use them and so it must be explict to drop them; that is the point here.

If you think about it would be buggy to remove them! IOW, how would the code know its 'ok' to drop them?

@mpschr
Copy link
Author

mpschr commented May 15, 2016

Hi @jreback I am not sure if we are talking about the same thing. I elaborte: I was referring to the data available in the DataFrame. Of course the categories which have been established as categories in df.chromosomes.cat.categories should never be dropped - even tough they are not represented in the DataFrame. Exactly as shown here:

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes
71    4
72    4
73    4
74    5
75    5
76    5
77    5
78    5
79    5
80    5
81    5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

But, analogously to this I would expect the following output after doing groupby:

df[df.chromosomes.isin(query_chroms)].groupby('chromosomes').A.sum().reset_index().chromosomes

#expected output:
chromosomes
1      4
2      5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

#but actual output is.

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
15    16
16    17
17    18
18    19
19    20
20    21
21    22
22     X
23     Y
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

The actual output (2nd option) we get here is misleading since all chromosomes except 4 and 5 are not in the supplied data, they are just 'acceptable' options. Is it possible that this two different viewpoints may contribute to the bug reported here?

@jankatins
Copy link
Contributor

@mpschr:
There is a different "view" for categoricals and groupby: if I have a lickert scale and want to get number of times each value was ticked, I want "unused" groups to show up as 0. That was at least the idea behind having all groups show up in groupby and such things.

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? Not really sure why we are excluding unused ones.

I think there was a specific reason why unique is now not returning the whole categories (AFAIK remember the first implementation simply returned the categories). I think because someone argued that the implicit API contract for unique ist that it returns only used values (and ordered in in appearance as that was what seaborn/plots expected).

@mpschr
Copy link
Author

mpschr commented May 17, 2016

Ok, so this is the current behaviour:

# 1.

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (2, object): [4 < 5]

again - here what a layman like me would expect is the following.

# 2.

query_chroms = ['4', '5']
df[df.chromosomes.isin(query_chroms)].chromosomes.unique()
#output
[4, 5]
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

Now I understood the bug :) The seaborn library should be able to work with the unique used values as in example 2, right?

@jreback
Copy link
Contributor

jreback commented May 17, 2016

@mpschr not sure what you mean. This is as expected. The point is that the category dtype IS propogated to ALL operations. There is extensive documentation on this. What exactly is not clear? (the bug in this issue is independent / not related to this).

In [8]: df[df.chromosomes.isin(query_chroms)].chromosomes
Out[8]: 
61    4
62    4
63    4
64    4
65    4
66    4
67    4
68    4
69    5
70    5
71    5
72    5
Name: chromosomes, dtype: category
Categories (24, object): [1 < 2 < 3 < 4 ... 21 < 22 < X < Y]

@mpschr
Copy link
Author

mpschr commented May 17, 2016

Yep @jreback - I think I went a bit off-topic with the groupby behaviour (including unused categories in the output of group aggregations).

In any case I totally agree with you on the matter with the unique behavior, as posted in my last comment. The unused categories should not be discarded from the cat.categories when gettting df.chromosomes.unique()

@jreback
Copy link
Contributor

jreback commented May 17, 2016

@JanSchulz

I think would just do this in groupby (or maybe add a kw arg to .unique to return all of the categories, even unsued ones; maybe we should do that by default)? Not really sure why we are excluding unused ones.

I think there was a specific reason why unique is now not returning the whole categories (AFAIK remember the first implementation simply returned the categories). I think because someone argued that the implicit API contract for unique ist that it returns only used values (and ordered in in appearance as that was what seaborn/plots expected).

yeah I don't really recall all of the discussion about .unique (though there were many!).

Yeah I can see how we just return the observed values

@jreback jreback modified the milestones: 0.19.0, Next Major Release Sep 28, 2016
@mpschr
Copy link
Author

mpschr commented Feb 1, 2017

Has this not been fixed yet (just curiosity)

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

@mpschr issues get closed when they are fixed. you are welcome to submit a PR to fix this. Community PR's push things along.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Feb 22, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#13179

Author: Kernc <kerncece@gmail.com>

Closes pandas-dev#15439 from kernc/Categorical.unique-nostrip-unused and squashes the following commits:

55733b8 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing
2aec326 [Kernc] fixup! BUG: Fix .groupby(categorical, sort=False) failing
c813146 [Kernc] PERF: add asv for categorical grouping
0c550e6 [Kernc] BUG: Fix .groupby(categorical, sort=False) failing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants