Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

Closed
wants to merge 1 commit into from

Conversation

pijucha
Copy link
Contributor

@pijucha pijucha commented Jul 31, 2016

This commit modifies MultiIndex.from_array and MultiIndex.from_product.

Example:

cat = pd.Categorical(['a', 'b'], categories=list("bac"), ordered=True)
mi = pd.MultiIndex.from_arrays([cat, cat])

mi.levels[0]
Out[55]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

mi.get_level_values(0)
Out[56]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

Previously, the results were:

mi.levels[0]
Out[345]: Index(['b', 'a', 'c'], dtype='object')

mi.get_level_values(0)
Out[346]: Index(['a', 'b'], dtype='object')

This modification makes groupby, pivot, and set_index preserve categorical types in indexes.

@codecov-io
Copy link

codecov-io commented Jul 31, 2016

Current coverage is 85.27% (diff: 100%)

Merging #13854 into master will decrease coverage by <.01%

@@             master     #13854   diff @@
==========================================
  Files           139        139          
  Lines         50555      50561     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43111      43116     +5   
- Misses         7444       7445     +1   
  Partials          0          0          

Powered by Codecov. Last update ccec504...99e4a52

@@ -855,3 +855,4 @@ Bug Fixes

- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just say MultiIndex constructor

@pijucha
Copy link
Contributor Author

pijucha commented Aug 2, 2016

Other places where a categorical dtype is lost in similar circumstances.

cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)
  1. concat

    df = pd.DataFrame([[10, 11, 12]])
    
    pd.concat([df, df], keys=cidx).index.levels[0]
    Out[32]: Index(['y', 'x'], dtype='object')
  2. stack with a non-unique index/multi-index:

    df = pd.DataFrame([[10, 11, 12]], columns=cidx_nonunique)
    
    df.stack().index.levels[1]
    Out[35]: Index(['x', 'y', 'z'], dtype='object')
  3. get_dummies

    pd.get_dummies(cidx).columns
    Out[36]: Index(['x', 'y', 'z'], dtype='object')
  4. make_axis_dummies with transform

    df = pd.DataFrame([[10, 11]], columns=cidx)
    ldf = pd.Panel({'A': df, 'B': df}).to_frame()
    
    pd.core.reshape.make_axis_dummies(panel.to_frame(), transform=lambda x: x).columns
    Out[53]: Index(['x', 'y', 'z'], dtype='object')
  5. panel_index

    pi = pd.core.panel.panel_index([0, 1, 2], cidx)
    pi.levels[1]
    Out[57]: Index(['x', 'y'], dtype='object', name='panel')
  6. pytables: LegacyTable.read()
    No quick example yet.

@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

@pijucha 5 & 6 you can ignore

@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

@pijucha this should exist as a separate function from the Categorical constructor as a private function (but you can put in pandas.core.categorical) is prob not a bad location. maybe _create_categoricals_from_arrays? (or similar). E.g. its a 'categorical' function, but returns labels/levels (and not exactly a cat).

can certainly merge this fix and then do a followup with a more reaching name change. lmk.

if is_categorical(values):
if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
values = values._values
categories = CategoricalIndex(values.categories,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a CI and not just a Categorial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency. The else part returns cat.categories, which is an Index.

@pijucha
Copy link
Contributor Author

pijucha commented Aug 2, 2016

@jreback OK, Thanks.

@pijucha
Copy link
Contributor Author

pijucha commented Aug 17, 2016

Sorry for a bit of delay. I fixed stack, get_dummies, make_axis_dummies (2-4 in the list above) and opened a separate issue for concat (and for two other issues I came across).

@@ -1607,6 +1607,113 @@ def test_map(self):
result = c.map(lambda x: 1)
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))

def test_groupby_preserve_dtype(self):
Copy link
Contributor

@jreback jreback Aug 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to test_groupby (for the groupby tests)

@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

small changes. looks pretty good.

@pijucha
Copy link
Contributor Author

pijucha commented Aug 22, 2016

I moved some tests to files I thought were more appropriate (though, I'm not 100% sure).

@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
if len(arrays[i]) != len(arrays[i - 1]):
raise ValueError('all arrays must be same length')

cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use Categorical.from_array any longer? (in the codebase)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just a few places: internals of concat and unstack, LegacyTable in pytables, and panel_index. I can replace them all with _factorize - it's pretty straightforward. It wouldn't automatically fix concat and unstack (that's why I opened separate issues for them) but wouldn't hurt either.

If we replace them, what should be done to the definition of Categorical.from_array? Remove completely or rather add a deprecation warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we should replace these and deprecate the constructor. but can do that later / another PR if desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll try to do it today if it goes smoothly. Otherwise, I'll do a follow up as soon as this PR is merged in.

Just a question:
Categorical.from_array should emit a FutureWarning with a comment like this: "Categorical.from_array is deprecated, use Categorical instead"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@jreback
Copy link
Contributor

jreback commented Aug 26, 2016

@pijucha test split looks good.

@sinhrks any comments?

@pijucha
Copy link
Contributor Author

pijucha commented Aug 27, 2016

Deprecated Categorical.from_array.

@pijucha
Copy link
Contributor Author

pijucha commented Aug 31, 2016

update

@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

lgtm. @sinhrks @jorisvandenbossche

@jreback jreback mentioned this pull request Aug 31, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.19.0rc Sep 1, 2016
@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

@pijucha can you rebase

…eprecate .from_array

Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`,
`get_dummies`, and `make_axis_dummies`.
@pijucha
Copy link
Contributor Author

pijucha commented Sep 2, 2016

@jreback Done (rebase + small update to tests/test_reshape.py).

One build on travis has probably stalled - should I resubmit?

@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

@pijucha i restarted the build. ping when green.

@jreback jreback closed this in d26363b Sep 2, 2016
@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

thanks @pijucha really nice PR. touches lots of parts!

@pijucha pijucha deleted the catdtype branch September 4, 2016 14:37
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0rc, 0.19.0 Sep 7, 2016
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 5, 2017
jreback pushed a commit that referenced this pull request Dec 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby on multiple columns does not preserve (categorical) dtype
5 participants