GridSearchCV extremely slow with DataFrameMapper? #11

andytwigg · 2014-05-09T14:44:05Z

I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:

df = ...
obj_cols = [(c, LabelBinarizer()) for c in X.columns if X.dtypes[c]=='O']
num_cols = [(c, StandardScaler()) for c in X.columns if X.dtypes[c]<>'O']
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid)
grid_search.fit(df[data], df[target]) # this is REALLY slow

From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:

for params in ParameterGrid(param_grid):
  pipeline.set_params(params)
  X_train, y_train, X_test, y_test = sklearn.cross_validation.train_test_split(df[data],df[target])
  pipeline.fit(X_train, y_train)
  score = pipeline.score(X_test, y_test)

X=mapper.fit_transform(df[data], y)
pipeline = Pipeline([ ('clf',SGDClassifier()) ])
grid_search = sklearn.cross_validation.GridSearchCV(pipeline,param_grid)
grid_search.fit(X,y)

So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?

More generally, is there a better way to handle categorical variables?

The text was updated successfully, but these errors were encountered:

ogrisel · 2014-05-09T15:06:24Z

Could you please try to provide a code snippet that generates random that exhibits the same behavior?

It would also be interesting to report the output of a profiler, for instance using the %prun magic command in an IPython session.

andytwigg · 2014-05-09T15:32:49Z

import numpy as np
import pandas as pd
import random
import sklearn_pandas
import sklearn.pipeline
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.linear_model import SGDClassifier


n = 1000
k = 100
cols = dict([(str(c),np.random.randint(1000, size=n)) for c in range(k)])
df = pd.DataFrame(cols)
df['target'] = np.random.randint(2, size=n)
data = list(range(k))
target = 'target'

obj_cols = [(c, LabelBinarizer()) for c in df.columns if df.dtypes[c]=='O' and c<>target]
num_cols = [(c, StandardScaler()) for c in df.columns if df.dtypes[c]<>'O' and c<>target]
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid, verbose=2)
grid_search.fit(df[data], df[target]) # this is REALLY slow

andytwigg · 2014-05-09T18:00:21Z

from %prun

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      830    1.623    0.002  129.768    0.156 __init__.py:71(_get_col_subset)
   225830    1.588    0.000  104.399    0.000 series.py:489(__getitem__)
...
       28    0.009    0.000    0.011    0.000 {sklearn.linear_model.sgd_fast.plain_sgd}

Is this helpful? It seems that almost all time is spent in _get_col_subset

tdhopper · 2014-11-12T03:40:22Z

I'm seeing very similar behavior with sklearn_pandas.cross_val_score, I believe.

dukebody · 2015-11-08T15:38:09Z

I've been investigating this and the culprits seem to be these lines:

time unit: 1e-6 s
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
90                                               @profile
91                                               def _get_col_subset(self, X, cols):
...
105                                           
106        45           27      0.6      0.0          if isinstance(X, list):
107       295       126293    428.1     70.4              X = [x[cols] for x in X]
108        45        48792   1084.3     27.2              X = pd.DataFrame(X)

Apparently the DataWrapper prevents sklearn cross-validation functions to turn the dataframe into a numpy array before getting to get_col_subset. Since the DataWrapper instance doesn't have a shape attribute, sklearn.cross_validation._safe_split returns a list of series corresponding to each row (example) to take part into CV. These Series are later grouped again in a dataframe inside the _get_col_subset method.

I'm not sure what is the best way to deal with this. Replacing the previous two lines with:

X = pd.DataFrame(X)

and leaving the cols slicing to the later code in the same function seems to provide a good speedup (around 3x) but I still have to write tests to ensure it doesn't break anything.

Perhaps we can get better speedups without the lists trick, but I don't know how to do that and at the same time avoid sklearn turning the dataframe into a numpy array.

Ideas welcome! :)

dukebody · 2015-11-08T15:56:02Z

Hm, I was testing it with scikit-learn==0.15.2. It looks like this might be already solved in scikit-learn>=0.16.0, since it uses an indexable function to check the input instead of check_arrays.

See #26 (comment) and https://github.com/scikit-learn/scikit-learn/blob/0.16.0/sklearn/cross_validation.py#L1350.

dukebody · 2015-12-08T11:29:30Z

Perhaps we should just write in the documentation that the custom cv-wrappers are only needed for scikit-learn<0.16.0 and that's all. What do you think @zacstewart ?

zacstewart · 2015-12-08T14:42:41Z

I think documenting that is a good idea, but also maybe pass-through sklearn_pandas.GridSearchCV to sklearn itself depending on the version. Is something like this worth uglying up the code to make it future-friendly?

from distutils.version import StrictVersion

if StrictVersion (sklearn.__version__) > StrictVersion('0.16'):
  sklearn_pandas.GridSearchCV = sklearn.grid_search.GridSearchCV

dukebody · 2015-12-19T19:07:56Z

I don't think it's worth uglying up the code that way. We can say that these wrappers are deprecated and will be eventually dropped in sklearn-pandas 2.0. I will however make sure in a test that sklearn.grid_search.GridSearchCV in scikit-learn>=0.16.0 works with a DataFrameMapper in a pipeline.

dukebody · 2015-12-19T20:55:23Z

@zacstewart can you review #48 please? It's a really minor addition but I always like the four-eyes approach to changes. :-)

Balandat · 2015-12-22T18:02:54Z

Along those lines: Unfortunately the function CalibrateClassifierCV introduced in sklearn 0.16 does not seem work with DataFrameMappers in a pipeline (this is still the case in sklearn 0.17)

dukebody · 2015-12-23T09:27:59Z

@Balandat Could you provide an example with a traceback (or wrong result)? Thanks.

Deprecate custom CV shims in documentation and code. Refs #11.

dukebody · 2016-01-16T10:43:03Z

@Balandat I'm closing this issue since it's already fixed. I've opened #53 to follow up on the issue you comment.

dukebody added the enhancement label Nov 8, 2015

dukebody added a commit that referenced this issue Dec 19, 2015

Deprecate custom CV shims in documentation and code. Refs #11.

a67ae98

dukebody added a commit that referenced this issue Jan 9, 2016

Deprecate custom CV shims in documentation and code. Refs #11.

367f049

dukebody added a commit that referenced this issue Jan 16, 2016

Merge pull request #48 from paulgb/deprecate-cv

85ad5df

Deprecate custom CV shims in documentation and code. Refs #11.

dukebody mentioned this issue Jan 16, 2016

CalibrateClassifierCV doesn't work with DataFrameMapper in a pipeline #53

Closed

dukebody closed this as completed Mar 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearchCV extremely slow with DataFrameMapper? #11

GridSearchCV extremely slow with DataFrameMapper? #11

andytwigg commented May 9, 2014

ogrisel commented May 9, 2014

andytwigg commented May 9, 2014

andytwigg commented May 9, 2014

tdhopper commented Nov 12, 2014

dukebody commented Nov 8, 2015

dukebody commented Nov 8, 2015

dukebody commented Dec 8, 2015

zacstewart commented Dec 8, 2015

dukebody commented Dec 19, 2015

dukebody commented Dec 19, 2015

Balandat commented Dec 22, 2015

dukebody commented Dec 23, 2015

dukebody commented Jan 16, 2016

GridSearchCV extremely slow with DataFrameMapper? #11

GridSearchCV extremely slow with DataFrameMapper? #11

Comments

andytwigg commented May 9, 2014

ogrisel commented May 9, 2014

andytwigg commented May 9, 2014

andytwigg commented May 9, 2014

tdhopper commented Nov 12, 2014

dukebody commented Nov 8, 2015

dukebody commented Nov 8, 2015

dukebody commented Dec 8, 2015

zacstewart commented Dec 8, 2015

dukebody commented Dec 19, 2015

dukebody commented Dec 19, 2015

Balandat commented Dec 22, 2015

dukebody commented Dec 23, 2015

dukebody commented Jan 16, 2016