Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearchCV extremely slow with DataFrameMapper? #11

Closed
andytwigg opened this issue May 9, 2014 · 13 comments
Closed

GridSearchCV extremely slow with DataFrameMapper? #11

andytwigg opened this issue May 9, 2014 · 13 comments

Comments

@andytwigg
Copy link

I have a dataframe, not particularly large (~3000 rows, 250 cols) on which I do the following:

df = ...
obj_cols = [(c, LabelBinarizer()) for c in X.columns if X.dtypes[c]=='O']
num_cols = [(c, StandardScaler()) for c in X.columns if X.dtypes[c]<>'O']
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid)
grid_search.fit(df[data], df[target]) # this is REALLY slow

From a quick glance, it seems to spend all its time indexing dataframe objects. The following 2 pieces of code are very fast:

for params in ParameterGrid(param_grid):
  pipeline.set_params(params)
  X_train, y_train, X_test, y_test = sklearn.cross_validation.train_test_split(df[data],df[target])
  pipeline.fit(X_train, y_train)
  score = pipeline.score(X_test, y_test)
X=mapper.fit_transform(df[data], y)
pipeline = Pipeline([ ('clf',SGDClassifier()) ])
grid_search = sklearn.cross_validation.GridSearchCV(pipeline,param_grid)
grid_search.fit(X,y)

So it must be something to do with using GridSearchCV with the DataFrameMapper. Any ideas?

More generally, is there a better way to handle categorical variables?

@ogrisel
Copy link
Contributor

ogrisel commented May 9, 2014

Could you please try to provide a code snippet that generates random that exhibits the same behavior?

It would also be interesting to report the output of a profiler, for instance using the %prun magic command in an IPython session.

@andytwigg
Copy link
Author

import numpy as np
import pandas as pd
import random
import sklearn_pandas
import sklearn.pipeline
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.linear_model import SGDClassifier


n = 1000
k = 100
cols = dict([(str(c),np.random.randint(1000, size=n)) for c in range(k)])
df = pd.DataFrame(cols)
df['target'] = np.random.randint(2, size=n)
data = list(range(k))
target = 'target'

obj_cols = [(c, LabelBinarizer()) for c in df.columns if df.dtypes[c]=='O' and c<>target]
num_cols = [(c, StandardScaler()) for c in df.columns if df.dtypes[c]<>'O' and c<>target]
param_grid = {
  'clf__loss': ['hinge', 'log', 'modified_huber'],
  'clf__penalty': ('l1', 'l2', 'elasticnet'),
}

pipeline = sklearn.pipeline.Pipeline([ 
  ('mapper', sklearn_pandas.DataFrameMapper(obj_cols+num_cols)),
  ('clf', SGDClassifier()),
])

grid_search = sklearn_pandas.GridSearchCV(pipeline, param_grid, verbose=2)
grid_search.fit(df[data], df[target]) # this is REALLY slow

@andytwigg
Copy link
Author

from %prun

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      830    1.623    0.002  129.768    0.156 __init__.py:71(_get_col_subset)
   225830    1.588    0.000  104.399    0.000 series.py:489(__getitem__)
...
       28    0.009    0.000    0.011    0.000 {sklearn.linear_model.sgd_fast.plain_sgd}

Is this helpful? It seems that almost all time is spent in _get_col_subset

@tdhopper
Copy link

I'm seeing very similar behavior with sklearn_pandas.cross_val_score, I believe.

@dukebody
Copy link
Collaborator

dukebody commented Nov 8, 2015

I've been investigating this and the culprits seem to be these lines:

time unit: 1e-6 s
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
90                                               @profile
91                                               def _get_col_subset(self, X, cols):
...
105                                           
106        45           27      0.6      0.0          if isinstance(X, list):
107       295       126293    428.1     70.4              X = [x[cols] for x in X]
108        45        48792   1084.3     27.2              X = pd.DataFrame(X)

Apparently the DataWrapper prevents sklearn cross-validation functions to turn the dataframe into a numpy array before getting to get_col_subset. Since the DataWrapper instance doesn't have a shape attribute, sklearn.cross_validation._safe_split returns a list of series corresponding to each row (example) to take part into CV. These Series are later grouped again in a dataframe inside the _get_col_subset method.

I'm not sure what is the best way to deal with this. Replacing the previous two lines with:

X = pd.DataFrame(X)

and leaving the cols slicing to the later code in the same function seems to provide a good speedup (around 3x) but I still have to write tests to ensure it doesn't break anything.

Perhaps we can get better speedups without the lists trick, but I don't know how to do that and at the same time avoid sklearn turning the dataframe into a numpy array.

Ideas welcome! :)

@dukebody
Copy link
Collaborator

dukebody commented Nov 8, 2015

Hm, I was testing it with scikit-learn==0.15.2. It looks like this might be already solved in scikit-learn>=0.16.0, since it uses an indexable function to check the input instead of check_arrays.

See #26 (comment) and https://github.com/scikit-learn/scikit-learn/blob/0.16.0/sklearn/cross_validation.py#L1350.

@dukebody
Copy link
Collaborator

dukebody commented Dec 8, 2015

Perhaps we should just write in the documentation that the custom cv-wrappers are only needed for scikit-learn<0.16.0 and that's all. What do you think @zacstewart ?

@zacstewart
Copy link
Contributor

I think documenting that is a good idea, but also maybe pass-through sklearn_pandas.GridSearchCV to sklearn itself depending on the version. Is something like this worth uglying up the code to make it future-friendly?

from distutils.version import StrictVersion

if StrictVersion (sklearn.__version__) > StrictVersion('0.16'):
  sklearn_pandas.GridSearchCV = sklearn.grid_search.GridSearchCV

@dukebody
Copy link
Collaborator

I don't think it's worth uglying up the code that way. We can say that these wrappers are deprecated and will be eventually dropped in sklearn-pandas 2.0. I will however make sure in a test that sklearn.grid_search.GridSearchCV in scikit-learn>=0.16.0 works with a DataFrameMapper in a pipeline.

@dukebody
Copy link
Collaborator

@zacstewart can you review #48 please? It's a really minor addition but I always like the four-eyes approach to changes. :-)

@Balandat
Copy link

Along those lines: Unfortunately the function CalibrateClassifierCV introduced in sklearn 0.16 does not seem work with DataFrameMappers in a pipeline (this is still the case in sklearn 0.17)

@dukebody
Copy link
Collaborator

@Balandat Could you provide an example with a traceback (or wrong result)? Thanks.

dukebody added a commit that referenced this issue Jan 16, 2016
Deprecate custom CV shims in documentation and code. Refs #11.
@dukebody
Copy link
Collaborator

@Balandat I'm closing this issue since it's already fixed. I've opened #53 to follow up on the issue you comment.

@dukebody dukebody closed this as completed Mar 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants