-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97
Conversation
Codecov Report
@@ Coverage Diff @@
## master #97 +/- ##
==========================================
+ Coverage 99.08% 99.08% +<.01%
==========================================
Files 30 30
Lines 5557 5577 +20
Branches 776 782 +6
==========================================
+ Hits 5506 5526 +20
Misses 28 28
Partials 23 23
Continue to review full report at Codecov.
|
1 similar comment
Sorry, I'm having trouble figuring out what this change is doing. There's something about moving a check earlier, and it's mixed in with a behavioral change to level ordering (but maybe just for pandas categoricals), and somehow things are faster...? Or something like that? Can you explain it to me like I'm 5? :-) |
@njsmith Sorry for lacking detail. I'm quite confused by it myself! Trust me, the developers of this tool are much more competent programmers than I am. I sort of only accidentally stumbled onto this behavior. Try this code... In [8]: import numpy as np
...: import pandas as pd
...: from patsy import dmatrix
...:
...: x = np.random.choice(list('abcdefg'), 10000000)
...: x = pd.Categorical(x)
...:
In [9]: %time dmatrix('x')
Wall time: 2.88 s
In [10]: %time dmatrix('C(x)')
Wall time: 46.5 s So something strange is happening when you use the I think the problem is here... https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L308 In this function when you put any factor in the Since the pandas categorical fails, it has to go through the whole So what this PR does is first un-box the Categorical box, then check if its categorical. The re-ordering piece was just because some tests were failing because patsy checks if the specified order levels in the However, there is another test that is supposed to raise an error when a Categorical variable has levels in a different order than the categories... To make those tests pass, I use the You could reject this PR and just tell users that if you are using a categorical to not use the And personally, I've always used Phew. That was a mouthful. |
4db1296
to
eab2043
Compare
…est I think contradicts another test.
18f2ab6
to
8970522
Compare
So when a column is declared a categorical, patsy is skipping currently skipping all the
pandas
builtins that greatly speed up the process.I think there was a piece of the code out-of-order here:
https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L312
I think we want to set
data = data.data
earlier to take advantage of pandas.So before making this change:
And after..