Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

Closed
wants to merge 6 commits into from

Conversation

thequackdaddy
Copy link
Contributor

So when a column is declared a categorical, patsy is skipping currently skipping all the pandas builtins that greatly speed up the process.

I think there was a piece of the code out-of-order here:

https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L312

I think we want to set data = data.data earlier to take advantage of pandas.

So before making this change:

from patsy import dmatrix
import numpy as np
import pandas as pd

x = np.random.choice(list('abcdefg'), size=1e7)
x = pd.Series(x, dtype=pd.Categorical)


C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  # (see spyder/__init__.py for details)

%time dmatrix('C(x)')


Wall time: 18.7 s

And after..

from patsy import dmatrix
import numpy as np
import pandas as pd

x = np.random.choice(list('abcdefg'), size=1e7)
x = pd.Series(x, dtype=pd.Categorical)


C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  # (see spyder/__init__.py for details)

%time dmatrix('C(x)')


Wall time: 1.14 s

@codecov-io
Copy link

codecov-io commented Oct 24, 2016

Codecov Report

Merging #97 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
+ Coverage   99.08%   99.08%   +<.01%     
==========================================
  Files          30       30              
  Lines        5557     5577      +20     
  Branches      776      782       +6     
==========================================
+ Hits         5506     5526      +20     
  Misses         28       28              
  Partials       23       23
Impacted Files Coverage Δ
patsy/util.py 98.47% <100%> (+0.05%)
patsy/categorical.py 99.59% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924b387...8970522. Read the comment docs.

@coveralls
Copy link

coveralls commented Oct 24, 2016

Coverage Status

Coverage decreased (-0.02%) to 98.594% when pulling 4db1296 on thequackdaddy:categorical3 into 20749a1 on pydata:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 98.594% when pulling 4db1296 on thequackdaddy:categorical3 into 20749a1 on pydata:master.

@njsmith
Copy link
Member

njsmith commented Oct 24, 2016

Sorry, I'm having trouble figuring out what this change is doing. There's something about moving a check earlier, and it's mixed in with a behavioral change to level ordering (but maybe just for pandas categoricals), and somehow things are faster...? Or something like that? Can you explain it to me like I'm 5? :-)

@thequackdaddy
Copy link
Contributor Author

@njsmith Sorry for lacking detail. I'm quite confused by it myself! Trust me, the developers of this tool are much more competent programmers than I am. I sort of only accidentally stumbled onto this behavior.

Try this code...

In [8]: import numpy as np
   ...: import pandas as pd
   ...: from patsy import dmatrix
   ...: 
   ...: x = np.random.choice(list('abcdefg'), 10000000)
   ...: x = pd.Categorical(x)
   ...: 

In [9]: %time dmatrix('x')
Wall time: 2.88 s

In [10]: %time dmatrix('C(x)')
Wall time: 46.5 s

So something strange is happening when you use the C(x) command on a categorical.

I think the problem is here...

https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L308

In this function when you put any factor in the C function, it enters the function as a _CategoricalBox. The categorical box stores the data as data.data, so you need to un-box the categorical stuff by doing a data = data.data before checking if data is safe to use as a pandas categorical. The safe pandas check will return False when a _CategoricalBox is sent.

Since the pandas categorical fails, it has to go through the whole NAAction check thing, which is super-comprehensive (yay!) yet not as fast as I would like it.

So what this PR does is first un-box the Categorical box, then check if its categorical.

The re-ordering piece was just because some tests were failing because patsy checks if the specified order levels in the C function match the order of the categories. pandas Series that use a string object type can be re-ordered, and the tests (AFAICT) assume that categoricals can be re-ordered too.

However, there is another test that is supposed to raise an error when a Categorical variable has levels in a different order than the categories... To make those tests pass, I use the reorder_categories method. That didn't exist prior to pandas 0.15.0 (I think), so for pandas 0.14.0 to pass, I reorder by hand (which would be very slow).

You could reject this PR and just tell users that if you are using a categorical to not use the C function. However, this allows fully-functioning Helmert and Poly coding at much, much faster speeds.

And personally, I've always used C(x) coding in R, even on factors that were already categories... Just a nice reminder of what I'm doing to that factor. So this lets you continue to use that practice and still be really fast.

Phew. That was a mouthful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants