Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.to_records, DataFrame constructor broken for categoricals #8626

Closed
fkaufer opened this issue Oct 24, 2014 · 24 comments · Fixed by #8652
Closed

BUG: DataFrame.to_records, DataFrame constructor broken for categoricals #8626

fkaufer opened this issue Oct 24, 2014 · 24 comments · Fixed by #8652
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@fkaufer
Copy link

fkaufer commented Oct 24, 2014

to_records() and the df constructor are broken for data containing categoricals when created with dtype='category' or astype('category').

Broken (TypeError: data type not understood):

pd.DataFrame(list('abc'),dtype='category')

df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype('category')
df.to_records()

Works:

pd.Series(list('abc'), dtype='category')

pd.DataFrame(list('abc'), dtype=pd.core.common.CategoricalDtype)

df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype(pd.core.common.CategoricalDtype)
df.to_records()

pd.Series(list('abc'), dtype='category').to_frame().to_records()

to_frame seems to remove the category dtype though.

Pandas version: 0.15.0-20-g2737f5a

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

hmm, the second part to_frame() converting back to object is a bug (it should remain a categorical)

I am not sure I buy the first part. If you want to see if you can 'fix' and demonstrate that would help.

@jreback jreback added Bug Categorical Categorical Data Type labels Oct 24, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 24, 2014
@fkaufer
Copy link
Author

fkaufer commented Oct 24, 2014

Actually, there are several issues. I only had the impression that they are somehow related in the sense that it makes a difference whether I use dtype=pd.core.common.CategoricalDtypeor dtype='category. But in fact for to_frame() this doesn't matter and removes the categorical either way.

My main problem is this

df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype('category')
df.to_records()

Result:

[...]
//anaconda/envs/pd15/lib/python2.7/site-packages/pandas-0.15.0_20_g2737f5a-py2.7-macosx-10.5-x86_64.egg/pandas/core/common.pyc in __bytes__(self)
    163         Yields a bytestring in both py2/py3.
    164         """
--> 165         from pandas.core.config import get_option
    166 
    167         encoding = get_option("display.encoding")

TypeError: data type not understood

I have no problems when converting as follows:

df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()

The same is true for pd.DataFrame(list('abc'), dtype='category') which I discovered only coincidently when I was about to prepare a minimal example to report this supposedly to_records issue; I usually don't use dtype='category' in the df constructor.

Another example which shows it is not only related to to_records:

df['A'] = df['A'].astype(pd.core.common.CategoricalDtype)
pd.lib.fast_zip([df.A.values, df.B.values])

... works as expected.

df['A'] = df['A'].astype('category')
pd.lib.fast_zip([df.A.values, df.B.values])

... returns:

...
//anaconda/envs/pd15/lib/python2.7/site-packages/pandas-0.15.0_20_g2737f5a-py2.7-macosx-10.5-x86_64.egg/pandas/lib.so in pandas.lib.fast_zip (pandas/lib.c:9860)()

SystemError: numpy/core/src/multiarray/iterators.c:370: bad argument to internal function

@jreback
Copy link
Contributor

jreback commented Oct 24, 2014

you realize that using to_ records() defeats the entire purpose of categorical as numpy cannot support this

@fkaufer
Copy link
Author

fkaufer commented Oct 25, 2014

Sure, but I don't see point. Most to_... methods decode categoricals, but obviously there are reasons to use them.

Ironically I came across this issue when creating factor/key variables from multiple columns:

pd.factorize(df.to_records())

But I ended up using pd.lib.fast_zip anyway, which is less convenient but faster by an order of magnitude:

pd.factorize(pd.lib.fast_zip([df[c].values for c in df.columns]))

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

by your reply I am confused

to_records gives you a structured array
the only thing you could do with a categorical is make it into a regular object array
the back end has to support the categorical - in fact NO back ends currently support it and only HDF5 will actually be able to reproduce it faithfully

@fkaufer
Copy link
Author

fkaufer commented Oct 25, 2014

Sorry for the confusion.

Just to make sure: the main issue here is the inconsistency between using pandas.core.common.CategoricalDtype and its alias 'category' for various functions (e.g. to_records) and the fact the using 'category' sometimes triggers exceptions. IMO these exceptions are bugs. But reading your comments above and on my request (#8633) for graceful degradation behavior for categorical conversion in general, makes be believe you consider the exceptions are intentional.

But if this is the case and you think to_records shouldn't work for categoricals and raising TypeError: data type not understood is correct, then to_records should also fail for pandas.core.common.CategoricalDtype, right?

But instead

df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype('category')
df.to_records()
Out[...]
...
TypeError: data type not understood

Whereas

df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
Out[...]
rec.array([(0, 'a'), (1, 'b'), (2, 'c')], 
      dtype=[('index', '<i8'), ('A', 'O')])

@fkaufer
Copy link
Author

fkaufer commented Oct 25, 2014

Oh wait, it seems the reason for the inconsistency is even more weird. The second call of .astype(pd.core.categorical.Categorical) converts back to object.

df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
...
TypeError: data type not understood
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
rec.array([(0, 'a'), (1, 'b'), (2, 'c')], 
      dtype=[('index', '<i8'), ('A', 'O')])
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
...
TypeError: data type not understood

.astype('category') is - as expected - idempotent on categoricals.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

@fkaufer

in the other thread, I was refering specifically to serialization/deserilization to other formats (e.g. to_*), that are actual formats, e.g. hdf,csv,stata,msgpack (NotImplemented)

while to_records is a bug, an is a conversion issue (this is a subtle point here). I agree this should simply coerce (and that's what this issue is about).

Further, you shoulld never need touch the actual CategoricalDtype object directly, and simply use 'category' (they should be interchangeable and if not that's a bug).

The reason we released categorical is its simply impossible to catch ALL possible cases. This is a massive addition and I think we got most, but that's why its helpful for you to find bugs! (which turn into tests and will get fixed for 0.15.1)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

I realize u r using pd.core.categorical.Category as a dtype
that is undefined! (the dtype is CategoricalDtype)

Categorical is the actual object

technically it's actually ok and doesn't actually raise (nor does Numpy when u do this) - but prob should simply raise as its not correct at all

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

@fkaufer so #8652
should fix all 3 issues that you brought up (in future, FYI, its better to put check boxes in the top so can keep track).

pls test and lmk.

@jreback jreback modified the milestones: 0.15.2, 0.15.1 Oct 30, 2014
@fkaufer
Copy link
Author

fkaufer commented Nov 8, 2014

Sorry for the late reply - to_records works, pd.lib.fast_zip still (0.15.1-1-g66a0a74) doesn't.

df=pd.DataFrame(np.random.choice(list(u'abcde'), 20).reshape(10, 2),
    columns=list(u'AB'))
pd.lib.fast_zip([df.A.values, df.B.values])
[...]
array([(u'c', u'b'), (u'e', u'a'), (u'c', u'b'), (u'd', u'e'),
       (u'e', u'a'), (u'a', u'a'), (u'a', u'e'), (u'e', u'e'),
       (u'e', u'c'), (u'e', u'a')], dtype=object)
df=pd.DataFrame(np.random.choice(list(u'abcde'), 20).reshape(10, 2),
    columns=list(u'AB'))
for col in df.columns: df[col] = df[col].astype('category')
pd.lib.fast_zip([df.A.values, df.B.values])
[...]
SystemError: numpy/core/src/multiarray/iterators.c:370: bad argument to internal function

But perhaps that's a different issue.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2014

that's an internal routine
I wouldn't expect it to work
you have to present an ndarray to it
why are you using it?

@jreback
Copy link
Contributor

jreback commented Nov 8, 2014

if u really want to use it
you can do
df.A.get_values() which is guaranteed to give u an ndarray

@fkaufer
Copy link
Author

fkaufer commented Nov 9, 2014

Oh, wasn't aware of the subtle difference between values and get_values(). Think, it's also not obvious from the doc. Gotcha.

I use pd.lib.fast_zip as input for pd.factorize (see comment above .. 15 days ago), since there is no DataFrame.factorize(), but only Series.factorize() (btw: why?):

df['gid'], grps =pd.factorize(pd.lib.fast_zip([df.A.get_values(), df.B.get_values()]))

@jreback
Copy link
Contributor

jreback commented Nov 9, 2014

@fkaufer I like that you are finding bugs in Categoricals! But you should never have to use .values/.get_values() or any routines internally. get_values() for all intents is a completely internal routine, and the user should never need use it.

you can do:

Dataframe.apply(lamda x: x.factorize())

or if you really want to factorize all values

pd.factorize(DataFrame.values.ravel())

note that using .values on a DataFrame coerces to a compatible dtype, so by definition be careful.

another approach is to simply construct the series and then factorize,

(df.A.append(df.B.append)).factorize()

but really why are you factorizing directly?

@fkaufer
Copy link
Author

fkaufer commented Nov 10, 2014

Seems, we're a bit in a loop. So let me explain a bit more detailed.

Upfront the general question: Why factorizing directly?
I need multi-column factorization in the sense of creating an integer id (while keeping the mapping key<->value-combinations separately) for every unique combination of values across multiple columns of a DataFrame (not necessary all columns!) for various use cases: Custom data cleansing/deduplication algorithms on column subsets (no primary key -> generate primary key candidates), DWH-like splitting of large aggregated data into fact tables/frames and dimension tables, preprocessing for Machine Learning algorithms (clustering, classification). For some of these use cases it's possible to use groupby-apply/transform, but not always.

Now: Why using the sketched technical approach?
The current approach is basically a cascade of workarounds and I'm aware that I'm partially using internal functions. Step by step:

  • Ideally, there would be a DataFrame.factorize() method, but there is only Series.factorize(). So I was searching for a workaround and discovered pd.factorize()
  • Unfortunately pd.factorize() does not take a DataFrame or ndarray (with n>1) either, so I have to "emulate" and pack n columns into a single column.
  • My first approach for this packing was to use DataFrame.to_records(). This was the original trigger for this issue, because to_records failed for categorical columns (side note: although factorize and categoricals are related, they are not in this context, it just happened to be that I had categoricals for the some of columns over which I factorized).
  • Since DataFrame.to_records() was broken for categoricals and is quite slow anyway, I searched for an alternative and found pd.lib.fast_zip which is really fast but being a zip function requires to split the DataFrame beforehand.
  • Next issue: pd.lib.fast_zip doesn't take an array of series', which is the reason I used .values. And .values is part of the API reference which makes me wonder that you state "you should never have to use .values [...]". I'd say you have to use .values whenever you leave pandas land (or enter pandas twilight zone -> pd.lib.fast_zip, pd.factorize), need pure Numpy data structures and DataFrame/Series are not coerced to suitable dtypes automatically, right?
  • So, pd.lib.fast_zip([df.A.values, df.B.values]) would work if A/B do not contain categoricals, i.e. df['A'] = df['A'].astype('category'); pd.lib.fast_zip([df.A.values, df.B.values]) raises SystemError: numpy/core/src/multiarray/iterators.c:370: bad argument to internal function.
  • This is what I reported two days ago. You replied with a recommendation to use "df.A.get_values() which is guaranteed to give u an ndarray". I haven't used get_values() before but indeed this works als for categoricals. In your last comment you now wrote "get_values() for all intents is a completely internal routine, and the user should never need use it.". That's what I meant with we're in a loop at the beginning.

To sum up, the following workaround now works:

pd.factorize(pd.lib.fast_zip([df[col].get_values() for col in factorize_cols]))

I don't see how your proposals are equivalent alternatives to what I'm doing:

  • Dataframe.apply(lambda x: x.factorize()) applies factorize for each Series separately.
  • pd.factorize(DataFrame.values.ravel()) treats the DataFrame as one single Series.

But if you provide a DataFrame.factorize() method such that I can simply do ...

df[factorize_cols].factorize()

... then I promise that I'll keep my hands off internal functions .. for now.

As a general remark regarding "categorical bug reporting". I'm not sitting here being overly eager to find as many categoricals bugs as possible and therefore also testing all internal functions. It's just that having categoricals is really beneficial for my work such that for some tasks I'm currently working directly on the development/master pandas branch. By that I just stumble across the issues and report them without distinguishing between issues with supposedly internal functions and official API functions but just report any inconsistencies I come across. And I think it is really important that categoricals behave consistently with other dtypes such that you can get just the extra benefits of categoricals without breaking existing code.

@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

can u show me a small example of what you are wanting

it seems that u simply want a categorical for a few selected columns that have the same categories?

give me a complete concrete example and I'll show you how you should do it

all of the functionary is there now (or if not we'll see what we can do)

DataFrame.factoriize() doesn't make sense but maybe your example will shed some light

@fkaufer
Copy link
Author

fkaufer commented Nov 10, 2014

Probably it would be better to open a new issue: "ENH: DataFrame.factorize()" but here you go ...

firstname lastname city login_date
John Doe London 2013-01-02
John Doe Berlin 2013-11-02
Peter Doe London 2013-11-03
John Doe London 2014-02-02
John Doe Berlin 2014-04-28
df['user_id_1'], labels_1 = df[['firstname', 'lastname']].factorize()
df['user_id_2'], labels_2 = df[['firstname', 'lastname', 'city']].factorize()
df
user_id_1 user_id_2 firstname lastname city login_date
1 1 John Doe London 2013-01-02
1 2 John Doe Berlin 2013-11-02
2 3 Peter Doe London 2013-11-03
1 1 John Doe London 2014-02-02
1 2 John Doe Berlin 2014-04-28

@jorisvandenbossche
Copy link
Member

A small example as far as I understand (@fkaufer correct me if I am wrong!)

In [41]: df = pd.DataFrame({'A':['a1','a1','a2','a2','a1'], 'B':['b1','b2','b1','b2','b1']})

In [42]: df
Out[42]: 
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
4  a1  b1

In [43]: cols_as_tuples = pd.lib.fast_zip([df[col].get_values() for col in df.columns])

In [44]: cols_as_tuples 
Out[44]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2'), ('a1', 'b1')], dtype=object)

In [47]: pd.factorize(cols_as_tuples)
Out[47]: 
(array([0, 1, 2, 3, 0]),
 array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))

In [48]: pd.Categorical(cols_as_tuples)
Out[48]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

In [59]: pd.Categorical(df.to_records(index=False))
Out[59]: 
[(a1, b1), (a1, b2), (a2, b1), (a2, b2), (a1, b1)]
Categories (4, object): [(a1, b1) < (a1, b2) < (a2, b1) < (a2, b2)]

So, @jreback, @fkaufer wants to make categories based on values as the full rows (combined values of all (or a selection of the) columns), a bit like df.drop_duplicates also regards the full rows as entities to check uniqueness.

@fkaufer I think this is a more uncommon operation, and I don't know if pandas should provide a built-in way to do this (I am not fully convinced that df.factorize() should do this. UPDATE: hmm, maybe this does make sense ..).
But given that, I think your solution (as I used it in the example above) seems quite good. The question is maybe if there should be some kind of to_array_of_tuples method (some wrapper around fast_zip, as is also used in DataFrame.duplicated)

@fkaufer
Copy link
Author

fkaufer commented Nov 11, 2014

DataFrame.factorize() makes as much sense as DataFrame.groupby() and DataFrame.drop_duplicates()/DataFrame.duplicated() make sense. The difference is just, that you want to deal with the groups (group ids) without (direct) aggregation/dropping and preserve the data frame structure.

@jorisvandenbossche I do not agree that this is uncommon. Generating ids for a column subset is an important preprocessing step for a lot of algorithms doing duplicate detection, clustering, classification, association rule mining, functional/inclusion dependency detection, etc. It is even more useful when dealing with denormalized dirty data which I would say is pandas' bread-and-butter business.

Stata has egen ... group for that purpose:

egen user_id = group(firstname lastname city)

http://www.stata.com/support/faqs/data-management/creating-group-identifiers/

In R I would to something like

transform(df, user_id = as.numeric(interaction(firstname lastname city, drop=TRUE)))

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@fkaufer

See #8709 for an idea to how to represent this. A 2-d categorical will represent this well (with tuples for categories). Will think about this for CategoricalIndex as well (which this would be natural at).

But this works now

In [36]: df
Out[36]: 
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
4  a1  b1

In [37]: columns = ['A','B']


In [50]: index = MultiIndex.from_arrays([df[col] for col in columns ])

In [51]: index
Out[51]: 
MultiIndex(levels=[[u'a1', u'a2'], [u'b1', u'b2']],
           labels=[[0, 0, 1, 1, 0], [0, 1, 0, 1, 0]],
           names=[u'A', u'B'])

In [52]: index.unique()
Out[52]: array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object)

In [53]: pd.factorize(index.values)
Out[53]: 
(array([0, 1, 2, 3, 0]),
 array([('a1', 'b1'), ('a1', 'b2'), ('a2', 'b1'), ('a2', 'b2')], dtype=object))

I suppose could make a cookbook entry for this. @fkaufer how are you then using the factorized values?

And here's the big difference between groupby/drop_duplicates...etc.

These don't preseve the structure! Instead they are an aggregation across a dimension (here some columns). I suppose you could make a DataFrame.factorize() to do this (or just have pd.factorize() handle this). I am still not understanding what you are going to DO with this.

Believe me, all for a better function/way to do X. But what is X here?

@fkaufer
Copy link
Author

fkaufer commented Nov 11, 2014

I'm wondering a bit why the use cases are not obvious, but I'll try elaborate on the "X" factor asap with an example.

In general the use cases are algorithms working on groups/value-combinations, so clustering in the broadest sense (I threw in some buzzwords in a comment above: "algorithms doing duplicate detection, clustering, classification, association rule mining, functional/inclusion dependency detection, etc."). Most ML algorithms for clustering and classification work on numeric values or at least it's faster to work on numeric/integer values instead of dealing with records potentially containing lengthy strings. So encodings of records is an important step.

And here's the big difference between groupby/drop_duplicates...etc.
These don't preseve the structure!

Yes exactly, that's what I said. For the use cases mentioned I want to keep the structure, partially because I want to work on many (potentially overlapping) groups in parallel.

Regarding your proposal. The result is equivalent, but I'm not convinced and here is why:

%timeit pd.factorize(pd.MultiIndex.from_arrays([df[columns]]))
1 loops, best of 3: 1.33 s per loop

# has not worked for categoricals, but is fixed now, see #8652
%timeit pd.factorize(df.to_records())
1 loops, best of 3: 1.42 s per loop

# still does not work for categoricals!
%timeit pd.lib.fast_zip([df[c].values for c in columns])
10 loops, best of 3: 99.8 ms per loop

# necessary when columns contain categoricals
%timeit pd.lib.fast_zip([df[c].get_values() for c in columns])
10 loops, best of 3: 99.4 ms per loop

# further alternatives possible with groupby.groups/groupby.indices

Actually I favor pd.factorize(df.to_records()) with respect to syntax but it's too slow. My workarounds with fast_zip are fast, but are inconvenient and look like workarounds, use internals and at least the .values subvariant is brittle.

Recall: this brittleness or more precise this inconsistency when using categoricals in some of the columns of a df is the actual topic of this issue, i.e. having categoricals in my data broke my workarounds for df.factorize(). Not having df.factorize() is of course not a bug but a new feature. I think a PR for DataFrame.factorize() using pd.lib.fast_zip is almost trivial, main effort is providing a test suite ... and initializing "X" of course.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@fkaufer

just show what you are doing with the results of factorize
an example of why you need it - he above shows why it might be nice to implement but you repeated you explanation - I want to see code that uses the RESULTS of it

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@fkaufer the reason I keep asking questions is that I want to know you flow better
I think .factorize() can easily be added but I am trying to figure out if there is a larger operation at work here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants