Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: categorical.reset_order #9190

Closed
jseabold opened this issue Jan 2, 2015 · 12 comments · Fixed by #9622
Closed

Feature Request: categorical.reset_order #9190

jseabold opened this issue Jan 2, 2015 · 12 comments · Fixed by #9622
Labels
Categorical Categorical Data Type Docs
Milestone

Comments

@jseabold
Copy link
Contributor

jseabold commented Jan 2, 2015

I was thinking about trying to add a to_unordered method, then I thought maybe a reset_order (or reorder?) with an optional drop keyword à la reset_index makes more sense. I didn't see if this was possible yet, so this is also a question. Is this possible via some other syntactic sugar? I might see if I can hack this together at some point unless someone beats me to it. My motivation for this is that I'm getting all ordered or unordered factors using read_stata. This could be "fixed" there by taking a list in addition to the boolean convert to ordered or whatever, but I think a method like this would be generally useful, plus I never peak at data before reading it.

@jseabold
Copy link
Contributor Author

jseabold commented Jan 5, 2015

Thanks, yeah that's helpful. AFAICT, it still doesn't look like there's a way to just drop the ordering though correct?

Re: read_stata, yeah I saw that, but it's an all or nothing proposition. No way to pass a list. Would be an easy fix. I'll look at it.

Unrelated, I'm sure it was discussed ad nauseam but I was also surprised that ordered is the default for Categorical. In my experience, unordered is more common.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2015

In [1]: import pandas as pd                                    

In [2]: s = pd.Series(list('aabcd'),dtype='category')          

In [3]: s                                                      
Out[3]:                                                        
0    a                                                         
1    a                                                         
2    b                                                         
3    c                                                         
4    d                                                         
dtype: category                                                
Categories (4, object): [a < b < c < d]                        

In [4]: s.cat.ordered                                          
Out[4]: True                                                   

In [5]: s.cat.ordered = False                                  

In [6]: s                                                      
Out[6]:                                                        
0    a                                                         
1    a                                                         
2    b                                                         
3    c                                                         
4    d                                                         
dtype: category                                                
Categories (4, object): [a, b, c, d]                           

just set the ordered flag to drop the ordering.

cc @JanSchulz .... do you recall the exact discussion w.r.t. ordered being True by default?

@jorisvandenbossche
Copy link
Member

@jreback Your example above of resetting the order can maybe be added to the docs (I didn't directly see this now in the docs. They speak about setting the order and sorting, but I did not find this)

@jorisvandenbossche jorisvandenbossche added Docs Categorical Categorical Data Type labels Jan 5, 2015
@jseabold
Copy link
Contributor Author

jseabold commented Jan 5, 2015

Oh, nice. I still think a method for for changing the state of the object
would be nice too. Methods are more or less self-documenting.

@jreback jreback added this to the 0.16.0 milestone Jan 5, 2015
@bashtage
Copy link
Contributor

bashtage commented Jan 6, 2015

The default to ordered=True was due to the fact that Stata only stores numerical data so that it is always possibly to order according to the numeric values, and that it was trivial to drop the ordering if needed, but non-trivial to re-assign it if read in as unordered.

Oops. this was only w.r.t read_stata, not categorical creation, if that is the nature of the above question.

@jankatins
Copy link
Contributor

In the discussion, we wanted to have a ordered categorical when the underlying data had an order, which is the case in most cases (ints, strings, ... are all orderable). So ordered actually defaults to false, but the default is not used in most cases...:

    def __init__(self, values, categories=None, ordered=None, name=None, fastpath=False,
                 levels=None):
    [...]
    # case without explicit categories
                # If the underlying data structure was sortable, and the user doesn't want to
                # "forget" this order, the categorical also is sorted/ordered
                if ordered is None:
                    ordered = True
    # case with explicit categories
            # if we got categories, we can assume that the order is intended
            # if ordered is unspecified
            if ordered is None:
                ordered = True
    [...]
    self.ordered = False if ordered is None else ordered

regarding a drop_ordering (or remove_ordering?): I'm not so sure what would be the expected results:

  • just the same as ordered=False or
  • also remove any ordering of the categories and 'resort' to the default order, which is defined on the individual elements (e.g. categories.sort())?

@jseabold
Copy link
Contributor Author

jseabold commented Jan 7, 2015

@bashtage Yeah, I just meant with Categorical in general.

@JanSchulz Hmm. I think the rationale for the default should be based on what's more common in the real world. My prior is that unordered factors are much more common. R defaults to unordered unless ordered=TRUE. Do people complain about this? Seems sane to me.

Re: drop ordering, it would just be the same as ordered=False. If the defaults don't change, I suspect people are going to be calling this a lot.

@jankatins
Copy link
Contributor

@jreback, @jseabold: I've no real preference on that default (i.e. I understand both rationales), but if that should change we should do it as early as possible as that's an API change...

@jankatins
Copy link
Contributor

If stata has ordered==False read_stata should build categoricals with a different default.

@bashtage
Copy link
Contributor

Stata's datafile format does not explicitly allow a determination of whether a labeled variable ordered or not - only the end user has this information. The primary reasons to import as an ordered categorical is that

  • Order can be trivially removed. As an aside, I'm not sure I even understand the issue of having an unordered categorical stored as an ordered, aside from mental bookeeping by the end user.
  • The ordinal information contained in the Stata data, if useful, is lost if imported as an unordered categorical

@jankatins
Copy link
Contributor

This topic is now also in #9347: should s.cat.order setable or only readable. If the latter, then a explicit as_unordered() or as_ordered() (or some other method) makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs
Projects
None yet
5 participants