ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

thequackdaddy · 2016-10-24T00:07:55Z

So when a column is declared a categorical, patsy is skipping currently skipping all the pandas builtins that greatly speed up the process.

I think there was a piece of the code out-of-order here:

https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L312

I think we want to set data = data.data earlier to take advantage of pandas.

So before making this change:

from patsy import dmatrix
import numpy as np
import pandas as pd

x = np.random.choice(list('abcdefg'), size=1e7)
x = pd.Series(x, dtype=pd.Categorical)


C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  # (see spyder/__init__.py for details)

%time dmatrix('C(x)')


Wall time: 18.7 s

And after..

from patsy import dmatrix
import numpy as np
import pandas as pd

x = np.random.choice(list('abcdefg'), size=1e7)
x = pd.Series(x, dtype=pd.Categorical)


C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  # (see spyder/__init__.py for details)

%time dmatrix('C(x)')


Wall time: 1.14 s

codecov-io · 2016-10-24T00:16:39Z

Codecov Report

Merging #97 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
+ Coverage   99.08%   99.08%   +<.01%     
==========================================
  Files          30       30              
  Lines        5557     5577      +20     
  Branches      776      782       +6     
==========================================
+ Hits         5506     5526      +20     
  Misses         28       28              
  Partials       23       23

Impacted Files	Coverage Δ
patsy/util.py	`98.47% <100%> (+0.05%)`	✅
patsy/categorical.py	`99.59% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924b387...8970522. Read the comment docs.

coveralls · 2016-10-24T00:16:44Z

Coverage decreased (-0.02%) to 98.594% when pulling 4db1296 on thequackdaddy:categorical3 into 20749a1 on pydata:master.

coveralls · 2016-10-24T00:16:44Z

Coverage decreased (-0.02%) to 98.594% when pulling 4db1296 on thequackdaddy:categorical3 into 20749a1 on pydata:master.

njsmith · 2016-10-24T04:36:47Z

Sorry, I'm having trouble figuring out what this change is doing. There's something about moving a check earlier, and it's mixed in with a behavioral change to level ordering (but maybe just for pandas categoricals), and somehow things are faster...? Or something like that? Can you explain it to me like I'm 5? :-)

thequackdaddy · 2016-10-24T15:29:09Z

@njsmith Sorry for lacking detail. I'm quite confused by it myself! Trust me, the developers of this tool are much more competent programmers than I am. I sort of only accidentally stumbled onto this behavior.

Try this code...

In [8]: import numpy as np
   ...: import pandas as pd
   ...: from patsy import dmatrix
   ...: 
   ...: x = np.random.choice(list('abcdefg'), 10000000)
   ...: x = pd.Categorical(x)
   ...: 

In [9]: %time dmatrix('x')
Wall time: 2.88 s

In [10]: %time dmatrix('C(x)')
Wall time: 46.5 s

So something strange is happening when you use the C(x) command on a categorical.

I think the problem is here...

https://github.com/pydata/patsy/blob/master/patsy/categorical.py#L308

In this function when you put any factor in the C function, it enters the function as a _CategoricalBox. The categorical box stores the data as data.data, so you need to un-box the categorical stuff by doing a data = data.data before checking if data is safe to use as a pandas categorical. The safe pandas check will return False when a _CategoricalBox is sent.

Since the pandas categorical fails, it has to go through the whole NAAction check thing, which is super-comprehensive (yay!) yet not as fast as I would like it.

So what this PR does is first un-box the Categorical box, then check if its categorical.

The re-ordering piece was just because some tests were failing because patsy checks if the specified order levels in the C function match the order of the categories. pandas Series that use a string object type can be re-ordered, and the tests (AFAICT) assume that categoricals can be re-ordered too.

However, there is another test that is supposed to raise an error when a Categorical variable has levels in a different order than the categories... To make those tests pass, I use the reorder_categories method. That didn't exist prior to pandas 0.15.0 (I think), so for pandas 0.14.0 to pass, I reorder by hand (which would be very slow).

You could reject this PR and just tell users that if you are using a categorical to not use the C function. However, this allows fully-functioning Helmert and Poly coding at much, much faster speeds.

And personally, I've always used C(x) coding in R, even on factors that were already categories... Just a nice reminder of what I'm doing to that factor. So this lets you continue to use that practice and still be really fast.

Phew. That was a mouthful.

…est I think contradicts another test.

thequackdaddy mentioned this pull request Oct 24, 2016

patsy questions/wishlist #93

Open

thequackdaddy force-pushed the categorical3 branch from 4db1296 to eab2043 Compare November 4, 2016 01:19

thequackdaddy added 6 commits February 28, 2017 18:58

ENH: This should make pandas.Categoricals much faster...

84bcf0b

TST: copy isn't availabe on categoricals on non-new pandas. Removed t…

f614fc3

…est I think contradicts another test.

reorder 0.14 by hand...

2924727

the attribute shoudl be categories

62618b6

0.18 gives a warning about accessing levels. Just go around that.

aa4cb3d

Can't run this if you don't have pandas

8970522

thequackdaddy force-pushed the categorical3 branch from 18f2ab6 to 8970522 Compare March 1, 2017 00:58

thequackdaddy closed this Feb 1, 2018

thequackdaddy deleted the categorical3 branch October 30, 2018 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

thequackdaddy commented Oct 24, 2016

codecov-io commented Oct 24, 2016 •

edited

Loading

coveralls commented Oct 24, 2016 •

edited

Loading

coveralls commented Oct 24, 2016

njsmith commented Oct 24, 2016

thequackdaddy commented Oct 24, 2016

ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

ENH/BUG: Let patsy use categoricals when they exist... BIG speed improvement #97

Conversation

thequackdaddy commented Oct 24, 2016

codecov-io commented Oct 24, 2016 • edited Loading

Codecov Report

coveralls commented Oct 24, 2016 • edited Loading

coveralls commented Oct 24, 2016

njsmith commented Oct 24, 2016

thequackdaddy commented Oct 24, 2016

codecov-io commented Oct 24, 2016 •

edited

Loading

coveralls commented Oct 24, 2016 •

edited

Loading