WIP: categoricals as an internal CategoricalBlock GH5313 #7217

jreback · 2014-05-23T13:13:36Z

This PR creates a CategoricalBlock as a first class internal object, on par with blocks like Datetime,Timedelta,Object,Numeric etc.

closes #5313
closes #5314
closes #3943

TODOS

test with select_dtypes ENH: select_dypes impl #7434, jreback@a43c6c0

Code Changes

Documentation

Add release notes
Add a API change note about the old constructor mode
Document the Categorical methods
Link classes/methods in categorical.rst -> not done, just a link to API docs and in special cases.
Add Categorical API to API docs

Future

HDF support? -> see above... ENH: Categorical serialized #7621
meta infos / "Contrasts": R's factor has a "contrast" property which indicates how this factor should be used in statistical models -> don't do it in the PR

jreback · 2014-05-23T14:35:20Z

cc @immerrr

jreback · 2014-05-23T14:58:37Z

pandas/core/common.py

+        if isinstance(other, compat.string_types):
+            return other == self.name
+        else:
+            return other == self.base


cc @immerrr if you have any ideas on how to make this more 'real', lmk. (e.g. this simply is not a sub-class of np.dtype, nor do I think you can actually make one).

Well, you can, but it's not trivial, see quaternion which is a "canonical" example of dtype subclass.

On a second thought, quaternion is not a sub class, rather a new class, so you may be right about subclassing.

yeh...that looked complicated. and I really need the sub-class what I have does work (its almost for display than anything)

There's a lot of bells and whistles in that example, that's true. A minimal viable new dtype can be found here .

jreback · 2014-05-24T16:32:27Z

@jseabold feel free to jump in here for commentary

jankatins · 2014-05-25T22:18:12Z

@jreback See https://github.com/JanSchulz/pandas/tree/categorical_improvements

new way to build a Categorical: Categorical(["a","b"], levels=["b","a"])
"ordered" keyword
- implement sort/min/max in Categorical
- sort in Series(Categorical(...))

What doesn't work yet:

min/max doesn't yet work as a Series(Categorical(...)) -> not sure yet what I have to implement to get this working
labels
assigning new levels (needs some reorderung,...)
performance improvements?

jreback · 2014-05-25T22:24:54Z

ok will have a look

why did you change labels -> _values_pointer
that is not common nomenclature afaict

on the surface this seems to break back compat is that true?

jankatins · 2014-05-25T22:33:33Z

The problem with the "old" labels was that they were "pointers" and labels in R have a different meaning:

R: labels are other names for levels
old pd.Categorical: labels are pointer to levels

pseudo code for future label usage, which is IMO compatible with R

a = Categorical([1,2,3,1], levels=[1,2,3])
a.labels = ["x","y","z"]
a.__array__() == ["x","y","z","y"]
a.labels = ["low","medium","high"]
a.__array__() == ["low","medium","high","low"]

jankatins · 2014-05-25T22:40:18Z

One question: how do one get to the underlying data structure from Series?

s = Series(Categorical(...)) -> how do I get the levels/labels from s

That would be the most interesting feature from ggplots side: we want to get the levels/labels from the dataframe which is passed to the plotting functions

jreback · 2014-05-25T22:40:34Z

ok so labels are descriptors of the levels (if provided)
and value_pointers are what we call labels now

hmm

what if we call values_pointers:

locs or index?

jreback · 2014-05-25T22:41:37Z

Series(Categorical(....)).values returns the Categorical itself

jreback · 2014-05-25T22:43:15Z

however we could define labels/levels as a property on Series that only works on categorical series (I think), or a method if u need to pass in arguments

jankatins · 2014-05-25T22:44:45Z

I don't mind the names, I just think the "pointers" are an internal implementation detail and shouldn't show up in tab completion.

jankatins · 2014-05-25T22:46:05Z

I'm off to bed, will be back tomorrow evening...

jreback · 2014-06-03T14:32:16Z

@JanSchulz do you have any idea where/what purpose of factor_agg/group_agg in core/frame.py are for? I don't see them used anywhere but in a couple of tests. Maybe existed before the groupby code handled Categoricals (which is what this looks like its doing). or am I missing something?

jreback · 2014-06-03T19:27:10Z

@JanSchulz

ok I have a new version which incorporates your changes

delegates reductions ops from Series to the Categorical itself (so these end up alling the min/max you defined)
other reduction ops raise TypeError (var,mean,sum) etc
numeric ops raise TypeError (e.g .cat1 + cat2)
sorting was off a bit; its now works on copies and sort is inplace, and order returns a copy (just like Series)
I changed _value_pointer back to labels for consistency with the rest of the library. I know you don't like it, but its 'internal' anyhow.

what's still missing?

jankatins · 2014-06-04T09:11:50Z

After a bit more poking at Rs factors, I think labels there are kind of waste of time: I think changing levels in R will do renaming/reducing/appending of levels. It seems labels are simple there to change the names of the levels in one go during creation of a factor, but that can also be done afterwards with changing the levels. So I'm fine with that change back from pointers to labels, although I still find it strange, as from my (German, so non-English) POV labels are "names for something", so actually I would have expected the other end to be the labels ("pointers point to labels").

in R, a categorical is always strings (so the first thing would be to convert values to strings), but I think we can miss that.

Anyway, I think the main part missing is a API which reorders/reduces/expand levels based on an input array/list and renaming of levels:

c = Series(Categorical([1,2,3,4,1], levels=[1,2,3,4])
# all following examples would operate on this one...
c.labels == [0,1,2,3,0]
c.levels = [1,2,3,4] 
c.get_values() == [1,2,3,4,1] 
# reorder would basically replace the levels and do a replace on the pointers/labels so that they point to the same 'levels' (or in my speak 'labels')
c.reorder([4,3,2,1]) # all "pointers" to '4' must be changed from 3 to 0,...
c.labels == [3,2,1,0,3] # positions are changed
c.levels == [4,3,2,1] # levels are now in new order
c.get_values() == [1,2,3,4,1] # output is the same
c.min() == 4
c.max() = 1
# assigning to levels would simple exchange the levels array
c.levels = [4,3,2,1] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2,1] 
c.get_values() == [4,3,2,1,4] 
c.min() == 4
c.max() = 1
# assigning a longer array will add a level 
c.levels = [4,3,2,1,0] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2,1,0] 
c.get_values() == [4,3,2,1,4] 
c.min() == 4
c.max() = 0
c.levels = [4,3,"a",2,1] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,"a",2,1] 
c.get_values() == [4,3,"a",2,4] 
# assigning a shorter array will make that values NA
c.levels = [4,3,2] 
c.labels == [0,1,2,NA,0]
c.levels == [4,3,2] 
c.get_values() == [4,3,2,NA,4] 
# assigning a NA level could do two things: make that value NA or the level would be NA
# I would vote for the latter, as otherwise we would need to check each new level for NAs...
c.levels = [4,3,2, NA] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2, NA] 
c.get_values() == [4,3,2,NA,4]

I also think that operations based on categorical levels (groupby, etc) should return groups for all levels and not only the ones, which have values -> empty groups for levels without values.

jorisvandenbossche · 2014-06-04T09:59:32Z

Note: I am not really following this, or a user of categorical types at the moment.

But, seeing this PR and the discussion, and as this is a really big and important new feature and addition to the pandas 'language' (not just a new method/function), I was wondering if it would be better to first lay-out the design (API, naming, ...) in a kind of 'design document' (something like a PEP for pandas)?

Just from a brief look, it seems there is still some discussion about the naming of things, how certain operations should work exactly, ... This is also something that should be discussed broader I think (with more people who would use this, send it out to the list), but it is now difficult to engage in the discussion with only the code to look at and the discussion in this PR.

Having an overview with the reasoning behind this new type, the naming, the API, how you create it, how to do common operations on it, how it behaves in other pandas methods, some examples of applications, (maybe a short comparison with R), ... could be beneficial for making this a more solid enhancement, but also to facilitate the dicussion (and afterwards this can be used as a start for some documentation, so certainly not 'wasted' in that regard).

Just a remark from the sideline. What do you think?
@JanSchulz Would you be able to start something like that?

jankatins · 2014-06-04T12:03:42Z

I think someone from the statistics side should comment on this: cc @jseabold @josef-pkt and @cancan101 @kshedden @upandacross @cfarmer because they had some issues open which showed up after a search for "categorical" in Statsmodels

I come from ggplot and plotting and I'm actually after the "reorder bar charts" (-> reorder levels)and "make faceting easier by letting empty levels show up in groupby" feature (see linked issues above).

I can do a comparison with R: creation, adding to a df, reorder levels, change levels, groupby. Something else?

jankatins · 2014-06-04T12:03:47Z

What also needs to be tested is "add to a df, sort the df on another column, see that the categorical series is changed accordingly". Also selection: df[df["cat_col"] == 4] should select the right rows

jankatins · 2014-07-06T21:55:32Z

@jreback Thanks!

I will try to add a few tests and see if everything works by tomorrow afternoon. It might happen that I don't get that far and I will be traveling until thursday.

jreback · 2014-07-06T22:01:12Z

np

this is going to let merge until next week anyhow (0.14.1 should be released on Friday) so after that

jreback · 2014-07-07T00:03:29Z

@JanSchulz updated with a test for using cats and non-cats. it ends up expanding the output space to bascially be non-compressed again, but will see if that is an issue.

Doc: Add Release notes for pandas-dev#7217

jankatins · 2014-07-10T18:07:14Z

@jreback please cherry-pick jankatins@b96cf3c (documentation updates in basics.rst for the new select_dtypes method)

jreback · 2014-07-10T18:15:19Z

@JanSchulz done

GH3943, GH5313, GH5314, GH7444 ENH: delegate _reduction and ops from Series to the categorical to support min/max and raise TypeError on other ops (numerical) and reduction Add Categorical Properties to Series Default to 'ordered' Categoricals if values are ordered Categorical: add level assignments and reordering + changed default for ordered Add a `Categorical.reorder_levels()` method. Change some naming in `Series`, so that the methods do not clash with established standards and rename the other categorical methods accordingly. Also change the default for `ordered` to True if values + levels are passed in at creation time. Initial doc version for working with Categorical data Categorical: add Categorical.mode() and use that in Series.mode() Categorical: implement remove_unused_levels() Categorical: implement value_count() for categorical series Categorical: make Series.astype("category") work ENH: add setitem to Categorical BUG: assigning to levels not in level set now raises ValueError API: disallow numpy ufuncs with categoricals Categorical: Categorical assignment to int/obj column ENH: add support for fillna to Categoricals API: deprecate old style categorical constructor usage and change default Before it was possible to pass in precomputed labels/pointer and the corresponding levels (e.g.: `Categorical([0,1,2], levels=["a","b","c"])`). This could lead to subtle errors in case of integer categoricals: the following could be both interpreted as "precomputed pointers and levels" or "values and levels", but converting it back to a integer array would result in different arrays: `np.array(Categorical([1,2], levels=[1,2,3]))` interpreted as pointers: `[2,3]` interpreted as values: `[1,2]` Up to now we would favour old style "pointer and levels" if these values could be interpreted as such (see code for details...). With this commit we favour new style "values and levels" and only attempt to interprete them as "pointers and levels" if "compat=True" is passed to the constructor. BREAKS: This will break code which uses Categoricals with "pointer and levels". A short google search and a search on stackoverflow revealed no such useage. Categorical: document constructor changes and small fixes Categorical: document that inappropriate numpy functions won't work anymore ENH: concat support

Doc: Add Release notes for pandas-dev#7217 DOC: update v0.15.0 notes Categorical: .codes should be immutable ERR: codes modification raises ValueError always Categorical: use Categorical.from_codes() in a few places Categorical: Fix assigning a Categorical to an existing string column CLN: CategoricalDtype repr now yields category DISPLAY: show dtype when displaying Categorical series (for consistency) BUG: fix groupby with multiple non-compressed categoricals Categorical: minor doc cleanups ENH: add a metaclass to CategoricalDtype to provide issubclass support (for select_dtypes) TST: io/pytables.py tests now raise NotImplementedError for dtype==category DOC: document the new category dtype in select_dtypes

jreback · 2014-07-14T21:29:26Z

@JanSchulz I just rebased this on current master. I think this is ready for merging. I am sure will have to do a follow-up for doc fixes / clarifications. But merging makes sense sooner rather than later (so it can be beat up a bit in master).

ok?

jankatins · 2014-07-14T21:35:37Z

Yep, I will look out for categorical bugs and try to handle them.

WIP: categoricals as an internal CategoricalBlock GH5313

jreback · 2014-07-14T21:42:46Z

@JanSchulz thanks for this
nice enhancement!

jreback · 2014-07-14T22:15:34Z

add to the list: link from the whatsnew categorical changes section to the docs is broken

jreback · 2014-07-14T22:20:15Z

might want to move this to cat section

http://pandas-docs.github.io/pandas-docs-travis/reshaping.html#computing-indicator-dummy-variables (and/or provide a link)

(maybe not the get_dummies but the factorization section)

jreback · 2014-07-14T22:29:56Z

I think maybe move categorical to right after reshaping?
(maybe call it Categorical Data?)

jankatins · 2014-07-16T16:48:13Z

Another thing. change ordered default in from_codes(ordered=False) -> Usually the logic has some "hints" when the constructor sets ordered=True (i.e. the values are sortable), but in the from_codes example we don't. I found that when I looked at the docs and read the from_codes example and found it very strange that in that "Test-Train" case the levels were ordered.

jreback · 2014-07-16T17:01:50Z

sure....go ahead an do a new PR for these items......(don't add to the old one).

jankatins · 2014-07-17T12:56:57Z

@jreback re Docs and factorization: IMO, this should stay there and only gain a new para to link to the categorical docs and an example how to get the same information from a categorical. Factorizations probably has it's uses without using a full Categorical?

Idea for the text:

Note: if you just want to handle one column as a categorical variable (R's factor), you can use df["cat_col"] = Categorical(df["col"]). See the categorical_documentation for more information. This feature was introduced in version 0.15.

jreback · 2014-07-17T13:01:20Z

ok that's fine, though definitily links back-forth would be good (e.g. use of cut from the Categorical section).

Also prob need to add entires/links to 10min.rst and cookbook

jankatins · 2014-07-23T22:32:33Z

This is continued in #7768. I added the links from and to other places in the docs, so everything here should be adressed in #7768

jreback added Enhancement labels May 23, 2014

jreback added this to the 0.14.1 milestone May 23, 2014

jreback changed the title ~~WIP: categoricals as a ninternal CategoricalBlock GH5313~~ WIP: categoricals as an internal CategoricalBlock GH5313 May 23, 2014

jreback mentioned this pull request May 23, 2014

ENH: Add support for Categoricals in BlockManager #5313

Closed

jreback reviewed May 23, 2014
View reviewed changes

shoyer mentioned this pull request May 28, 2014

DOC: Clarify 'public-ish' API for packages using pandas. #5460

Closed

This was referenced Jun 3, 2014

How should users reorder the x axis bins in a bar chart? yhat/ggpy#315

Closed

facets with descrete values (e.g. geom_bar) does not work yhat/ggpy#196

Closed

jankatins mentioned this pull request Jun 4, 2014

catch/handle pandas.Categoricals statsmodels/statsmodels#1148

Closed

jankatins mentioned this pull request Jul 9, 2014

ENH: select_dypes impl #7434

Merged

jreback pushed a commit to jreback/pandas that referenced this pull request Jul 9, 2014

Categorical: Thanks for Jan Schulz for much of the work on Categoricals

b8972bb

Doc: Add Release notes for pandas-dev#7217

jreback and others added 2 commits July 14, 2014 17:16

jreback added a commit that referenced this pull request Jul 14, 2014

Merge pull request #7217 from jreback/categorical

570584c

WIP: categoricals as an internal CategoricalBlock GH5313

jreback merged commit 570584c into pandas-dev:master Jul 14, 2014

armaganthis3 mentioned this pull request Jul 16, 2014

GH6848 silently changed series.sort from stable to unstable sort #7750

Closed

jankatins mentioned this pull request Jul 23, 2014

Categorical fixups #7768

Closed

5 tasks

jorisvandenbossche mentioned this pull request Aug 19, 2014

Discussion: feedback on the Categorical integration #8074

Closed

jankatins mentioned this pull request Sep 19, 2014

Use geo accessor for GeoSeries methods? geopandas/geopandas#166

Open

jreback mentioned this pull request Feb 10, 2017

ENH: Intervalindex #15309

Closed

WIP: categoricals as an internal CategoricalBlock GH5313 #7217

WIP: categoricals as an internal CategoricalBlock GH5313 #7217

Conversation

jreback commented May 23, 2014

jreback commented May 23, 2014

jreback May 23, 2014

Choose a reason for hiding this comment

immerrr May 23, 2014

Choose a reason for hiding this comment

immerrr May 23, 2014

Choose a reason for hiding this comment

jreback May 23, 2014

Choose a reason for hiding this comment

immerrr May 23, 2014

Choose a reason for hiding this comment

jreback commented May 24, 2014

jankatins commented May 25, 2014

jreback commented May 25, 2014

jankatins commented May 25, 2014

jankatins commented May 25, 2014

jreback commented May 25, 2014

jreback commented May 25, 2014

jreback commented May 25, 2014

jankatins commented May 25, 2014

jankatins commented May 25, 2014

jreback commented Jun 3, 2014

jreback commented Jun 3, 2014

jankatins commented Jun 4, 2014

jorisvandenbossche commented Jun 4, 2014

jankatins commented Jun 4, 2014

jankatins commented Jun 4, 2014

jankatins commented Jul 6, 2014

jreback commented Jul 6, 2014

jreback commented Jul 7, 2014

jankatins commented Jul 10, 2014

jreback commented Jul 10, 2014

jreback commented Jul 14, 2014

jankatins commented Jul 14, 2014

jreback commented Jul 14, 2014

jreback commented Jul 14, 2014

jreback commented Jul 14, 2014

jreback commented Jul 14, 2014

jankatins commented Jul 16, 2014

jreback commented Jul 16, 2014

jankatins commented Jul 17, 2014

jreback commented Jul 17, 2014

jankatins commented Jul 23, 2014