Mapie can not use Pipelines to its full extent, throws exception #149

nilslacroix · 2022-03-28T17:58:26Z

Describe the bug
I want to use mapie on a model, which I obtained from gscv.best_estimator_ . The model uses a pipeline, which looks like this:

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(n_jobs=-1,
                                   transformers=[('enc_plz',
                                                  BinaryEncoder(drop_invariant=True),
                                                  ['Postcode']),
                                                 ('enc_obj',
                                                  OneHotEncoder(drop_invariant=True),
                                                  ['PropertyType']),
                                                 ('features', 'passthrough',
                                                  Index(['YearSurvey', 'GeoY', 'SecondBathroom', 'Income'], dtype='object')),
                                                 ('log',
                                                  FunctionTransformer(func=<ufunc 'log1p'>,
                                                                      validate...
                                                  Index(['Pensioner', 'Balcony', 'YearModernization', 'YearBuilt'], dtype='object'))])),
                ('scaler', RobustScaler()),
                ('clf',
                 ClfSwitcher(estimator=LGBMRegressor(bagging_fraction=0.4,
                                                     bagging_freq=4,
                                                     bagging_seed=15871193,
                                                     feature_fraction=0.11,
                                                     feature_fraction_seed=15871193,
                                                     learning_rate=0.007,
                                                     max_bin=63,
                                                     min_data_in_leaf=10,
                                                     n_estimators=4000,
                                                     num_leaves=6,
                                                     objective='regression',
                                                     random_state=15871193)))])

When I use the lines


mapie = MapieRegressor(best_estimator_, method="plus", cv=4)
mapie.fit(X_train, y_train)

Mapie throws the exception:

ValueError: could not convert string to float: 'EFH'

in

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight) 457 cv = self._check_cv(self.cv) 458 estimator = self._check_estimator(self.estimator) --> 459 X, y = check_X_y( 460 X, y, force_all_finite=False, dtype=["float64", "int", "object"] 461 )

X_train and y_train are still in raw format (strings, not scaled, ....) the pipeline was designed to adress this.

My guess is that when mapie.fit is called on X_train the categorical variable "EFH" produces the error because it
is not float64, int or object type. However the pipeline would adress this by using an encoder.

Expected behavior
I would expect for the pipeline to preprocess my data before throwing a exception because of a wrong datatype.

The text was updated successfully, but these errors were encountered:

gmartinonQM · 2022-03-28T18:53:52Z

Hi @nilslacroix , which version of MAPIE are you using ?

nilslacroix · 2022-03-28T20:44:55Z

Hi @nilslacroix , which version of MAPIE are you using ?

I am using version 0.3.1 for mapie, conda 4.12, python 3.9.6 and scikit 0.24.2 on a windows 10 machine

nilslacroix · 2022-03-28T20:48:27Z

I deleted the whole part in line 458 where it does the check, which lets the LGBM regressor start but after that I get this index error:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11188/477517254.py in <module>
      4 
      5 mapie = MapieRegressor(akc[0][2], method="plus", cv=4)
----> 6 mapie.fit(X_train, y_train)
      7 #y_pred[name], y_pis[name] = mapie.predict(X_test, alpha=0.05)

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in fit(self, X, y, sample_weight)
    495                 self.n_samples_val_ = [X.shape[0]]
    496             else:
--> 497                 outputs = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(
    498                     delayed(self._fit_and_predict_oof_model)(
    499                         clone(estimator),

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~\miniconda3\envs\Master_ML\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~\miniconda3\envs\Master_ML\lib\site-packages\mapie\regression.py in _fit_and_predict_oof_model(self, estimator, X, y, train_index, val_index, k, sample_weight)
    366 
    367         """
--> 368         X_train, y_train, X_val = X[train_index], y[train_index], X[val_index]
    369         if sample_weight is None:
    370             estimator = fit_estimator(estimator, X_train, y_train)

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3459             if is_iterator(key):
   3460                 key = list(key)
-> 3461             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3462 
   3463         # take() does not accept boolean indexers

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~\miniconda3\envs\Master_ML\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Int64Index([ 648,  649,  650,  651,  652,  653,  654,  655,  656,  657,\n            ...\n            2582, 2583, 2584, 2585, 2586, 2587, 2588, 2589, 2590, 2591],\n           dtype='int64', length=1944)] are in the [columns]"

Seems to be related to line 368 in regression.py

gmartinonQM · 2022-03-29T07:10:04Z

Hi @nilslacroix , which version of MAPIE are you using ?

I am using version 0.3.1 for mapie, conda 4.12, python 3.9.6 and scikit 0.24.2 on a windows 10 machine

Ok, could you try with latest version 0.3.2 of MAPIE ? We fixed a similar issue recently : #128

nilslacroix · 2022-04-11T13:03:38Z

The latest version works fine. Thank you very much fr your work on this project :)

nilslacroix added the bug Something isn't working label Mar 28, 2022

gmartinonQM self-assigned this Mar 28, 2022

gmartinonQM closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapie can not use Pipelines to its full extent, throws exception #149

Mapie can not use Pipelines to its full extent, throws exception #149

nilslacroix commented Mar 28, 2022

gmartinonQM commented Mar 28, 2022

nilslacroix commented Mar 28, 2022

nilslacroix commented Mar 28, 2022

gmartinonQM commented Mar 29, 2022

nilslacroix commented Apr 11, 2022

Mapie can not use Pipelines to its full extent, throws exception #149

Mapie can not use Pipelines to its full extent, throws exception #149

Comments

nilslacroix commented Mar 28, 2022

gmartinonQM commented Mar 28, 2022

nilslacroix commented Mar 28, 2022

nilslacroix commented Mar 28, 2022

gmartinonQM commented Mar 29, 2022

nilslacroix commented Apr 11, 2022