Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: categoricals as an internal CategoricalBlock GH5313 #7217

Merged
merged 2 commits into from
Jul 14, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 23, 2014

This PR creates a CategoricalBlock as a first class internal object, on par with blocks like Datetime,Timedelta,Object,Numeric etc.

closes #5313
closes #5314
closes #3943

TODOS

Code Changes

  • factor_agg/group_agg in core/frame.py -> look at unittests // a google search didn't turn up any questions/usage -> remove them?
  • Add a API change note about the group_agg and factor_agg
  • Add a API change note about Categorical.labels -> Categorical.codes
  • Printing level information when a Series of type categorical is printed?
  • Fix the remaining "FIXME" in tests
    • Groupby -> include each level, even if group is empty
    • Pivotable -> include each level, even if group is empty
    • Series(categorical).describe() / Categorical.unique() -> should this return all levels or only used levels?
    • Series(cat).describe() -> show information about the levels?
    • sort by index?
    • df.to_csv: fails due to a slicer error?
    • concat/append -> should retain categoricals
    • concat/append -> should raise on different levels
    • sorting a dataframe by a categorical variable does not use the level ordering
    • sorting by a unsortable categorical should not be possible and should raise
    • reorder_levels: raise if set(old_levels) != set(new_levels)?
    • min/max and numeric_only=True
    • df.to_hdf: fails due to categorical.T not implemented
    • Category.describe() with empty levels (will be fixed with groupby)
  • TST: apply -> look into it if/how it works -> should probably convert first and don't try to preserve the categorical
  • TST: def test_sort_dataframe(self): -> sort df, cats must also be sorted!
  • TST: Add tests about some more numpy function that should fail
  • labels vs. pointers: should the (internal) name of integer array be renamed? -> I would like to see a rename to something with a underscore and it would be nice to rename it to another name (_pointers, _level_idx, _level_pointer)

Documentation

  • Add release notes
  • Add a API change note about the old constructor mode
  • Document the Categorical methods
  • Link classes/methods in categorical.rst -> not done, just a link to API docs and in special cases.
  • Add Categorical API to API docs

Future

  • HDF support? -> see above... ENH: Categorical serialized #7621
  • meta infos / "Contrasts": R's factor has a "contrast" property which indicates how this factor should be used in statistical models -> don't do it in the PR

@jreback jreback added this to the 0.14.1 milestone May 23, 2014
@jreback jreback changed the title WIP: categoricals as a ninternal CategoricalBlock GH5313 WIP: categoricals as an internal CategoricalBlock GH5313 May 23, 2014
@jreback
Copy link
Contributor Author

jreback commented May 23, 2014

cc @immerrr

if isinstance(other, compat.string_types):
return other == self.name
else:
return other == self.base
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @immerrr if you have any ideas on how to make this more 'real', lmk. (e.g. this simply is not a sub-class of np.dtype, nor do I think you can actually make one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can, but it's not trivial, see quaternion which is a "canonical" example of dtype subclass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, quaternion is not a sub class, rather a new class, so you may be right about subclassing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh...that looked complicated. and I really need the sub-class what I have does work (its almost for display than anything)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of bells and whistles in that example, that's true. A minimal viable new dtype can be found here .

@jreback
Copy link
Contributor Author

jreback commented May 24, 2014

@jseabold feel free to jump in here for commentary

@jankatins
Copy link
Contributor

@jreback See https://github.com/JanSchulz/pandas/tree/categorical_improvements

  • new way to build a Categorical: Categorical(["a","b"], levels=["b","a"])
  • "ordered" keyword
    • implement sort/min/max in Categorical
    • sort in Series(Categorical(...))

What doesn't work yet:

  • min/max doesn't yet work as a Series(Categorical(...)) -> not sure yet what I have to implement to get this working
  • labels
  • assigning new levels (needs some reorderung,...)
  • performance improvements?

@jreback
Copy link
Contributor Author

jreback commented May 25, 2014

ok will have a look

why did you change labels -> _values_pointer
that is not common nomenclature afaict

on the surface this seems to break back compat is that true?

@jankatins
Copy link
Contributor

The problem with the "old" labels was that they were "pointers" and labels in R have a different meaning:

R: labels are other names for levels
old pd.Categorical: labels are pointer to levels

pseudo code for future label usage, which is IMO compatible with R

a = Categorical([1,2,3,1], levels=[1,2,3])
a.labels = ["x","y","z"]
a.__array__() == ["x","y","z","y"]
a.labels = ["low","medium","high"]
a.__array__() == ["low","medium","high","low"]

@jankatins
Copy link
Contributor

One question: how do one get to the underlying data structure from Series?

s = Series(Categorical(...)) -> how do I get the levels/labels from s

That would be the most interesting feature from ggplots side: we want to get the levels/labels from the dataframe which is passed to the plotting functions

@jreback
Copy link
Contributor Author

jreback commented May 25, 2014

ok so labels are descriptors of the levels (if provided)
and value_pointers are what we call labels now

hmm

what if we call values_pointers:

locs or index?

@jreback
Copy link
Contributor Author

jreback commented May 25, 2014

Series(Categorical(....)).values returns the Categorical itself

@jreback
Copy link
Contributor Author

jreback commented May 25, 2014

however we could define labels/levels as a property on Series that only works on categorical series (I think), or a method if u need to pass in arguments

@jankatins
Copy link
Contributor

I don't mind the names, I just think the "pointers" are an internal implementation detail and shouldn't show up in tab completion.

@jankatins
Copy link
Contributor

I'm off to bed, will be back tomorrow evening...

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2014

@JanSchulz do you have any idea where/what purpose of factor_agg/group_agg in core/frame.py are for? I don't see them used anywhere but in a couple of tests. Maybe existed before the groupby code handled Categoricals (which is what this looks like its doing). or am I missing something?

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2014

@JanSchulz

ok I have a new version which incorporates your changes

  • delegates reductions ops from Series to the Categorical itself (so these end up alling the min/max you defined)
  • other reduction ops raise TypeError (var,mean,sum) etc
  • numeric ops raise TypeError (e.g .cat1 + cat2)
  • sorting was off a bit; its now works on copies and sort is inplace, and order returns a copy (just like Series)
  • I changed _value_pointer back to labels for consistency with the rest of the library. I know you don't like it, but its 'internal' anyhow.

what's still missing?

@jankatins
Copy link
Contributor

After a bit more poking at Rs factors, I think labels there are kind of waste of time: I think changing levels in R will do renaming/reducing/appending of levels. It seems labels are simple there to change the names of the levels in one go during creation of a factor, but that can also be done afterwards with changing the levels. So I'm fine with that change back from pointers to labels, although I still find it strange, as from my (German, so non-English) POV labels are "names for something", so actually I would have expected the other end to be the labels ("pointers point to labels").

in R, a categorical is always strings (so the first thing would be to convert values to strings), but I think we can miss that.

Anyway, I think the main part missing is a API which reorders/reduces/expand levels based on an input array/list and renaming of levels:

c = Series(Categorical([1,2,3,4,1], levels=[1,2,3,4])
# all following examples would operate on this one...
c.labels == [0,1,2,3,0]
c.levels = [1,2,3,4] 
c.get_values() == [1,2,3,4,1] 
# reorder would basically replace the levels and do a replace on the pointers/labels so that they point to the same 'levels' (or in my speak 'labels')
c.reorder([4,3,2,1]) # all "pointers" to '4' must be changed from 3 to 0,...
c.labels == [3,2,1,0,3] # positions are changed
c.levels == [4,3,2,1] # levels are now in new order
c.get_values() == [1,2,3,4,1] # output is the same
c.min() == 4
c.max() = 1
# assigning to levels would simple exchange the levels array
c.levels = [4,3,2,1] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2,1] 
c.get_values() == [4,3,2,1,4] 
c.min() == 4
c.max() = 1
# assigning a longer array will add a level 
c.levels = [4,3,2,1,0] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2,1,0] 
c.get_values() == [4,3,2,1,4] 
c.min() == 4
c.max() = 0
c.levels = [4,3,"a",2,1] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,"a",2,1] 
c.get_values() == [4,3,"a",2,4] 
# assigning a shorter array will make that values NA
c.levels = [4,3,2] 
c.labels == [0,1,2,NA,0]
c.levels == [4,3,2] 
c.get_values() == [4,3,2,NA,4] 
# assigning a NA level could do two things: make that value NA or the level would be NA
# I would vote for the latter, as otherwise we would need to check each new level for NAs...
c.levels = [4,3,2, NA] 
c.labels == [0,1,2,3,0]
c.levels == [4,3,2, NA] 
c.get_values() == [4,3,2,NA,4] 

I also think that operations based on categorical levels (groupby, etc) should return groups for all levels and not only the ones, which have values -> empty groups for levels without values.

@jorisvandenbossche
Copy link
Member

Note: I am not really following this, or a user of categorical types at the moment.

But, seeing this PR and the discussion, and as this is a really big and important new feature and addition to the pandas 'language' (not just a new method/function), I was wondering if it would be better to first lay-out the design (API, naming, ...) in a kind of 'design document' (something like a PEP for pandas)?

Just from a brief look, it seems there is still some discussion about the naming of things, how certain operations should work exactly, ... This is also something that should be discussed broader I think (with more people who would use this, send it out to the list), but it is now difficult to engage in the discussion with only the code to look at and the discussion in this PR.

Having an overview with the reasoning behind this new type, the naming, the API, how you create it, how to do common operations on it, how it behaves in other pandas methods, some examples of applications, (maybe a short comparison with R), ... could be beneficial for making this a more solid enhancement, but also to facilitate the dicussion (and afterwards this can be used as a start for some documentation, so certainly not 'wasted' in that regard).

Just a remark from the sideline. What do you think?
@JanSchulz Would you be able to start something like that?

@jankatins
Copy link
Contributor

I think someone from the statistics side should comment on this: cc @jseabold @josef-pkt and @cancan101 @kshedden @upandacross @cfarmer because they had some issues open which showed up after a search for "categorical" in Statsmodels

I come from ggplot and plotting and I'm actually after the "reorder bar charts" (-> reorder levels)and "make faceting easier by letting empty levels show up in groupby" feature (see linked issues above).

I can do a comparison with R: creation, adding to a df, reorder levels, change levels, groupby. Something else?

@jankatins
Copy link
Contributor

What also needs to be tested is "add to a df, sort the df on another column, see that the categorical series is changed accordingly". Also selection: df[df["cat_col"] == 4] should select the right rows

@jankatins
Copy link
Contributor

@jreback Thanks!

I will try to add a few tests and see if everything works by tomorrow afternoon. It might happen that I don't get that far and I will be traveling until thursday.

@jreback
Copy link
Contributor Author

jreback commented Jul 6, 2014

np

this is going to let merge until next week anyhow (0.14.1 should be released on Friday) so after that

@jreback
Copy link
Contributor Author

jreback commented Jul 7, 2014

@JanSchulz updated with a test for using cats and non-cats. it ends up expanding the output space to bascially be non-compressed again, but will see if that is an issue.

@jankatins jankatins mentioned this pull request Jul 9, 2014
jreback pushed a commit to jreback/pandas that referenced this pull request Jul 9, 2014
@jankatins
Copy link
Contributor

@jreback please cherry-pick jankatins@b96cf3c (documentation updates in basics.rst for the new select_dtypes method)

@jreback
Copy link
Contributor Author

jreback commented Jul 10, 2014

@JanSchulz done

jreback and others added 2 commits July 14, 2014 17:16
     GH3943, GH5313, GH5314, GH7444

ENH: delegate _reduction and ops from Series to the categorical
     to support min/max and raise TypeError on other ops (numerical) and reduction

Add Categorical Properties to Series

Default to 'ordered' Categoricals if values are ordered

Categorical: add level assignments and reordering + changed default for ordered

Add a `Categorical.reorder_levels()` method. Change some naming in `Series`,
so that the methods do not clash with established standards and rename the
other categorical methods accordingly.

Also change the default for `ordered` to True if values + levels are passed
in at creation time.

Initial doc version for working with Categorical data

Categorical: add Categorical.mode() and use that in Series.mode()

Categorical: implement remove_unused_levels()

Categorical: implement value_count() for categorical series

Categorical: make Series.astype("category") work

ENH: add setitem to Categorical

BUG: assigning to levels not in level set now raises ValueError

API: disallow numpy ufuncs with categoricals

Categorical: Categorical assignment to int/obj column

ENH: add support for fillna to Categoricals

API: deprecate old style categorical constructor usage and change default

Before it was possible to pass in precomputed labels/pointer and the
corresponding levels (e.g.: `Categorical([0,1,2], levels=["a","b","c"])`).

This could lead to subtle errors in case of integer categoricals: the
following could be both interpreted as "precomputed pointers and
levels" or "values and levels", but converting it back to a integer
array would result in different arrays:

`np.array(Categorical([1,2], levels=[1,2,3]))`
interpreted as pointers: `[2,3]`
interpreted as values: `[1,2]`

Up to now we would favour old style "pointer and levels" if these
values could be interpreted as such (see code for details...). With
this commit we favour new style "values and levels" and only attempt
to interprete them as "pointers and levels" if "compat=True" is passed
to the constructor.

BREAKS: This will break code which uses Categoricals with "pointer and
levels". A short google search and a search on stackoverflow revealed
no such useage.

Categorical: document constructor changes and small fixes

Categorical: document that inappropriate numpy functions won't work anymore

ENH: concat support
Doc: Add Release notes for pandas-dev#7217

DOC: update v0.15.0 notes

Categorical: .codes should be immutable

ERR: codes modification raises ValueError always

Categorical: use Categorical.from_codes() in a few places

Categorical: Fix assigning a Categorical to an existing string column

CLN: CategoricalDtype repr now yields category
DISPLAY: show dtype when displaying Categorical series (for consistency)

BUG: fix groupby with multiple non-compressed categoricals

Categorical: minor doc cleanups

ENH: add a metaclass to CategoricalDtype to provide issubclass support (for select_dtypes)

TST: io/pytables.py tests now raise NotImplementedError for dtype==category

DOC: document the new category dtype in select_dtypes
@jreback
Copy link
Contributor Author

jreback commented Jul 14, 2014

@JanSchulz I just rebased this on current master. I think this is ready for merging. I am sure will have to do a follow-up for doc fixes / clarifications. But merging makes sense sooner rather than later (so it can be beat up a bit in master).

ok?

@jankatins
Copy link
Contributor

Yep, I will look out for categorical bugs and try to handle them.

jreback added a commit that referenced this pull request Jul 14, 2014
WIP: categoricals as an internal CategoricalBlock GH5313
@jreback jreback merged commit 570584c into pandas-dev:master Jul 14, 2014
@jreback
Copy link
Contributor Author

jreback commented Jul 14, 2014

@JanSchulz thanks for this
nice enhancement!

@jreback
Copy link
Contributor Author

jreback commented Jul 14, 2014

add to the list: link from the whatsnew categorical changes section to the docs is broken

@jreback
Copy link
Contributor Author

jreback commented Jul 14, 2014

might want to move this to cat section

http://pandas-docs.github.io/pandas-docs-travis/reshaping.html#computing-indicator-dummy-variables (and/or provide a link)

(maybe not the get_dummies but the factorization section)

@jreback
Copy link
Contributor Author

jreback commented Jul 14, 2014

I think maybe move categorical to right after reshaping?
(maybe call it Categorical Data?)

@jankatins
Copy link
Contributor

Another thing. change ordered default in from_codes(ordered=False) -> Usually the logic has some "hints" when the constructor sets ordered=True (i.e. the values are sortable), but in the from_codes example we don't. I found that when I looked at the docs and read the from_codes example and found it very strange that in that "Test-Train" case the levels were ordered.

@jreback
Copy link
Contributor Author

jreback commented Jul 16, 2014

sure....go ahead an do a new PR for these items......(don't add to the old one).

@jankatins
Copy link
Contributor

@jreback re Docs and factorization: IMO, this should stay there and only gain a new para to link to the categorical docs and an example how to get the same information from a categorical. Factorizations probably has it's uses without using a full Categorical?

Idea for the text:

Note: if you just want to handle one column as a categorical variable (R's factor), you can use df["cat_col"] = Categorical(df["col"]). See the categorical_documentation for more information. This feature was introduced in version 0.15.

@jreback
Copy link
Contributor Author

jreback commented Jul 17, 2014

ok that's fine, though definitily links back-forth would be good (e.g. use of cut from the Categorical section).

Also prob need to add entires/links to 10min.rst and cookbook

@jankatins
Copy link
Contributor

This is continued in #7768. I added the links from and to other places in the docs, so everything here should be adressed in #7768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
10 participants