Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapie can not use Pipelines to its full extent, throws exception #149

Closed
nilslacroix opened this issue Mar 28, 2022 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@nilslacroix
Copy link

Describe the bug
I want to use mapie on a model, which I obtained from gscv.best_estimator_ . The model uses a pipeline, which looks like this:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(n_jobs=-1,
                                   transformers=[('enc_plz',
                                                  BinaryEncoder(drop_invariant=True),
                                                  ['Postcode']),
                                                 ('enc_obj',
                                                  OneHotEncoder(drop_invariant=True),
                                                  ['PropertyType']),
                                                 ('features', 'passthrough',
                                                  Index(['YearSurvey', 'GeoY', 'SecondBathroom', 'Income'], dtype='object')),
                                                 ('log',
                                                  FunctionTransformer(func=<ufunc 'log1p'>,
                                                                      validate...
                                                  Index(['Pensioner', 'Balcony', 'YearModernization', 'YearBuilt'], dtype='object'))])),
                ('scaler', RobustScaler()),
                ('clf',
                 ClfSwitcher(estimator=LGBMRegressor(bagging_fraction=0.4,
                                                     bagging_freq=4,
                                                     bagging_seed=15871193,
                                                     feature_fraction=0.11,
                                                     feature_fraction_seed=15871193,
                                                     learning_rate=0.007,
                                                     max_bin=63,
                                                     min_data_in_leaf=10,
                                                     n_estimators=4000,
                                                     num_leaves=6,
                                                     objective='regression',
                                                     random_state=15871193)))])

When I use the lines


mapie = MapieRegressor(best_estimator_, method="plus", cv=4)
mapie.fit(X_train, y_train)

Mapie throws the exception:

ValueError: could not convert string to float: 'EFH'

in

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight) 457 cv = self._check_cv(self.cv) 458 estimator = self._check_estimator(self.estimator) --> 459 X, y = check_X_y( 460 X, y, force_all_finite=False, dtype=["float64", "int", "object"] 461 )

X_train and y_train are still in raw format (strings, not scaled, ....) the pipeline was designed to adress this.

My guess is that when mapie.fit is called on X_train the categorical variable "EFH" produces the error because it
is not float64, int or object type. However the pipeline would adress this by using an encoder.

Expected behavior
I would expect for the pipeline to preprocess my data before throwing a exception because of a wrong datatype.

@nilslacroix nilslacroix added the bug Something isn't working label Mar 28, 2022
@gmartinonQM
Copy link
Contributor

Hi @nilslacroix , which version of MAPIE are you using ?

@gmartinonQM gmartinonQM self-assigned this Mar 28, 2022
@nilslacroix
Copy link
Author

Hi @nilslacroix , which version of MAPIE are you using ?

I am using version 0.3.1 for mapie, conda 4.12, python 3.9.6 and scikit 0.24.2 on a windows 10 machine

@nilslacroix
Copy link
Author

I deleted the whole part in line 458 where it does the check, which lets the LGBM regressor start but after that I get this index error:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11188/477517254.py in <module>
      4 
      5 mapie = MapieRegressor(akc[0][2], method="plus", cv=4)
----> 6 mapie.fit(X_train, y_train)
      7 #y_pred[name], y_pis[name] = mapie.predict(X_test, alpha=0.05)

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
    495                 self.n_samples_val_ = [X.shape[0]]
    496             else:
--> 497                 outputs = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(
    498                     delayed(self._fit_and_predict_oof_model)(
    499                         clone(estimator),

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in _fit_and_predict_oof_model(self, estimator, X, y, train_index, val_index, k, sample_weight)
    366 
    367         """
--> 368         X_train, y_train, X_val = X[train_index], y[train_index], X[val_index]
    369         if sample_weight is None:
    370             estimator = fit_estimator(estimator, X_train, y_train)

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3459             if is_iterator(key):
   3460                 key = list(key)
-> 3461             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3462 
   3463         # take() does not accept boolean indexers

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Int64Index([ 648,  649,  650,  651,  652,  653,  654,  655,  656,  657,\n            ...\n            2582, 2583, 2584, 2585, 2586, 2587, 2588, 2589, 2590, 2591],\n           dtype='int64', length=1944)] are in the [columns]"

Seems to be related to line 368 in regression.py

@gmartinonQM
Copy link
Contributor

Hi @nilslacroix , which version of MAPIE are you using ?

I am using version 0.3.1 for mapie, conda 4.12, python 3.9.6 and scikit 0.24.2 on a windows 10 machine

Ok, could you try with latest version 0.3.2 of MAPIE ? We fixed a similar issue recently : #128

@nilslacroix
Copy link
Author

The latest version works fine. Thank you very much fr your work on this project :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants