Out of fold stacking regressor #201

EikeDehling · 2017-06-08T11:44:40Z

Description

I've implemented a new ensemble regressor, for out-of-fold stacking. It's a different approach for training base regressors, that better avoids overfitting. For description of algorithm, see:

https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/#Stacking

I've only implemented the algorithm and some basic tests, but not written documentation yet - right now i'd like to know if you are interested in including this algorithm in the mlxtend code base before i spend more time.

If you're interested to include this, i'm happy to iterate on review / code!

Thanks for taking a look at this!

Pull Request requirements

Added appropriate unit test functions in the ./mlxtend/*/tests directories
Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
Checked the test coverage by running nosetests ./mlxtend --with-coverage
Checked for style issues by running flake8 ./mlxtend
Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
Modify documentation in the appropriate location under mlxtend/docs/sources/ (optional)
Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

pep8speaks · 2017-06-08T11:44:53Z

Hello @EikeDehling! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 13, 2017 at 02:40 Hours UTC

rasbt · 2017-06-08T20:11:01Z

Hi, @EikeDehling ,

thanks a lot for the PR, this sounds awesome, and I am really looking forward to contributions like these! I haven't read through the code in detail, but based on the description on your website, it looks like it is an approach similar to the StackingCVClassifier (http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/) but for regression?

In the StackingCVClassifier the 1st level models are fit to the training fold, and their prediction on the test fold are then used to fit the 2nd level model. It sounds like the OutOfFoldStackingRegressor is doing the same thing for training? However, for the prediction, the StackingCVClassifier does the following: 1st level models make individual predictions, the 2nd level classifier uses these predictions to make its prediction, which is then regarded as the output of the stacking classifier. When I understand correctly (based on the text description), the OutOfFoldStackingRegressor uses the same approach for the test set prediction?

However, based on the figure, it looks like it adds the average of the 1st-level models to the final prediction? Or in other words, the OutOfFoldStackingRegressor test prediction is composed of two terms. The averaged 1st-level predictions + the prediction of the 2nd level model?

I am just wondering about the differences and similarities between the StackingCVClassifier and OutOfFoldStackingRegressor, because if they are indeed similar, we should probably use the same name (e.g., StackingCVRegressor and refer to it as sth like StackingCVRegressor implements out-of-fold stacking for regression models in the documentation). What do you think?

Generally, I am all in favor of adding this implementation to mlxtend, thanks a lot for considering this contribution!

coveralls · 2017-06-09T09:45:22Z

Coverage increased (+0.2%) to 93.637% when pulling b76f9ce on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

EikeDehling · 2017-06-09T10:00:15Z

Hi @rasbt,

thanks for the quick response.

The algorithm indeed looks like StackingCVClassifier, but adapted for regression. It looks like there are some subtle differences though.

I'm fine with your naming suggestion, only remark is: There is no cross-validation going on in the algorithm as far as i can see? I don't have a strong opinion on this naming though. Happy to go with your choice.

The algorithm divides training data into K folds. I then trains N instances of each base model type, each on K-1 parts of training data. Each instance makes predictions of the remaining piece of training data. Then the predictions for each model are concatenated and used as input for the second level model. This is all identical to the StackingCVClassifier.

Now the difference starts, the N instances of the base model are kept, and used during predicting. There it does N predictions, one with each instance of the base model, and averages them as input for the second level model. Then the second level model predicts final output.

I'm not a theoretical ML expert, so i'm not sure what approach works better. Perhaps the ideas from StackingCVClassifier would even be an improvement? I would be happy to run some experiments and then code the best performing version.

Best regards, Eike

rasbt · 2017-06-09T14:25:11Z

Thanks for the thorough explanation, @EikeDehling !

I'm fine with your naming suggestion, only remark is: There is no cross-validation going on in the algorithm as far as i can see? I don't have a strong opinion on this naming though. Happy to go with your choice.

I agree with you, it's not really a k-fold cross-validation going on here, but just a k-fold like sampling. I think we chose that name because we had the StackingClassifier and they called in Stacking classification with k-fold cross-validation in

Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.

I also don't have a strong preference, but maybe we should stick to StackingCVRegressor for simplicity/analogy.

I'm not a theoretical ML expert, so i'm not sure what approach works better. Perhaps the ideas from StackingCVClassifier would even be an improvement? I would be happy to run some experiments and then code the best performing version.

Is this based on some paper in literature, or is this a "new" algorithm you came up with during experimentation? If you'd run some experiments to compare both approaches, that would be great, but don't worry about it if it's too much work (however, that could also be an interesting study for a paper if this hasn't been done before :)).

Now the difference starts, the N instances of the base model are kept, and used during predicting. There it does N predictions, one with each instance of the base model, and averages them as input for the second level model. Then the second level model predicts final output.

Say we have n=10 samples in the dataset and k=5 1st-level regressors. Then, during prediction, the 1st level regressors first produce n*k=50 predictions, which are then averaged over k so that the 2nd level regressor gets n values instead of all 50 values? And the final prediction is the prediction of the 2nd level regressor on those n values? I think the only difference to the StackingCVClassifier would be the averaging part. I.e., if you would stack the predictions instead of averaging them prior to giving them to the 2nd-level classifier, the algorithms would be identical except classification vs regression.

If that's indeed the case, I'd suggest to maybe toggle the averaging by a parameters. For example,

StackingCVRegressor(, ..., use_averaged_predictions_in_secondary)

or sth like that. The default could then be the setting (True/False) that would work better in practice.

PS: There's another parameter in the StackingCVClassifier that could be transferred to the StackingCVRegressor`:

use_features_in_secondary : bool (default: False)

If True, the meta-classifier will be trained both on the predictions of the original classifiers and the original dataset. If False, the meta-classifier will be trained only on the predictions of the original classifiers.

I am not sure if that helps with the performance of the StackingCVRegressor in practice, though. But it could be included as an additional option so that users can run their own experiments ...

(I just see that "predictions of the original classifiers." should probably be changed to "predictions of the level-1 classifiers." for clarity.

EikeDehling · 2017-06-10T08:17:33Z

Hi !

The algorithm came from here, there is a good graphical explanation also:

https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/#Stacking

Renamed to StackingCVRegressor now.

I will do some experiments on re-training the level-1 models vs using multiple and averaging, will let you know. Also will have a look at documenting this.

Thanks!

coveralls · 2017-06-10T08:24:35Z

Coverage increased (+0.2%) to 93.637% when pulling 788f8bd on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

EikeDehling · 2017-06-10T15:04:06Z

I've tried out what difference the approaches make:

Retrain one instance of each layer-1 model on full data and use that as input for level-2 predictions
Train K instances of each layer-1 model on part of the data, keep them around and average their predictions per model ; result is one column of data per level-1 model as input for the level-2 predictinos.

Approach 2 is what the existing StackingCVClassifier does, approach 1 is what i saw documented elsewhere.

My results: https://www.kaggle.com/eikedehling/trying-out-stacking-approaches/code

Summary: It doesn't make any significant difference for results. Sometimes one version is slightly better, sometimes the other version does a bit better.

rasbt · 2017-06-10T15:14:02Z

Thanks for testing these out! Just wanted to mention that in 2), the model 1 classifiers (in the StackingCVClassifier) are also fit on the whole training set in the end as per
https://github.com/rasbt/mlxtend/blob/master/mlxtend/classifier/stacking_cv_classification.py#L205

If it's not too complicated, it would be nice to have a parameter to toggle between the two different approaches like mentioned above : StackingCVRegressor(, ..., use_averaged_predictions_in_secondary). If that makes the implementation too complicated, let's maybe not worry about that.

Either way, as long it's documented what it does, it would be fine :)

rasbt · 2017-06-10T15:15:23Z

mlxtend/regressor/stacking_cv_regression.py

+        # is trained and makes predictions, after which we train the
+        # meta-regressor on their combined results.
+        #
+        for i, clf in enumerate(self.regressors):


Not that important since it's an internal variable, but I would suggest changing clf to regr or so

rasbt · 2017-06-10T16:09:13Z

I just re-examined the code and made a high-level summary for myself:

StackingCVClassifier

create a clone of each level-1 classifier (say we have 5)
for each fold in the training set:
- for each level-1 clone:
  - fit level-1 clone to training fold
  - predict labels in the test fold
  - add prediction to single_model_prediction array

if we have cv=3, that is 3 test folds, and each
test fold has 50 samples, the resulting
single_model_prediction array
should be shape [150, 1] for each model

stack the different single_model_prediction arrays into a all_pred_array
if we have 45 level-1 models, the resulting array
has now shape [150, 5]
fit level-2 classifier to the [150, 5] array
refit all level-1 classifiers to the original dataset [150, n_features] array
in prediction, for each level-1 clone:
- predict labels in the test dataset
- add prediction to single_model_prediction array
stack the different single_model_prediction arrays into a all_pred_array
make final prediction on the [n_samples, level-1_models] array via the 2nd level classifier

Current StackingCVRegressor

for regressor in level-1 regressors (say we have 5)
- for training fold in dataset (say we have 3)
  - make a copy of the regressor
  - fit regressor to training fold
  - predict on test fold

if the test fold has 50 samples, and we have 3 test folds,
we get an array of shape [150, 5], similar to the StackingCVClassifier

fit meta-regressor on the [150, 5] test fold predictions

For the test set prediction, we use all n_level1_regressors x n_folds regressors,
so we have 5x3=15 regressors for the test set . This is done like this
[[regr1, regr1, regr1],..., [regr5, regr5, regr5]. Then for each regr_i, we compute the average prediction that is passed to the meta-regressor.

Diff

The `StackingCVClassifier is refit to the whole training set after fitting the meta-classifier
The StackingCVRegressor has multiple regressors, one created for each fold during training. In contrast, instead of creating a level-1 classifier for each fold, the StackingClassifier's level-1 classifiers are fit on the whole training set.

I think your implementation looks fine! I probably wouldn't add more complexity if the approach works well in practice. We just need to document the behavior properly :).

rasbt · 2017-06-10T16:12:14Z

Maybe you could prepare a Jupyter Notebook similar to the one for the StackingCVClassifier and I can add a summarizing figure to your text as it's done for the StackingCVClassifier then -- also to dobule-check if that's okay with you.

EikeDehling · 2017-06-11T19:49:53Z

Hi @rasbt,

Thanks for the feedback and review!

I've adjusted the StackingCVRegressor to match StackingCVClassifiers algorithm (training level-1 regressors on full dataset ; get rid of the N regressors and averaging). The results were just as good, so maybe stick with that approach.. Also implemented the use_features_in_secondary option.

I've made a start with a notebook in the docs, will let you know when i'm done there.

Best, Eike

coveralls · 2017-06-11T20:03:27Z

Coverage increased (+0.2%) to 93.66% when pulling a29452b on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

coveralls · 2017-06-11T20:04:37Z

Coverage increased (+0.2%) to 93.66% when pulling a29452b on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

EikeDehling · 2017-06-12T07:59:20Z

I think the jupyter notebook now covers the important things. The other notebooks also include API docs, maybe you can help me to get that working. I assume that's automatically generated?

coveralls · 2017-06-12T08:09:57Z

Coverage increased (+0.2%) to 93.66% when pulling 530d7ee on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

coveralls · 2017-06-12T08:31:20Z

Coverage increased (+0.2%) to 93.66% when pulling 6ffb708 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

rasbt · 2017-06-12T18:01:57Z

Looks great so far! And I'd be happy with the documentation details -- it's mostly automated, but I think it'd be easier if I can apply the respective changes.

If you don't mind, could you enable the "allow maintainers to edit" option?

EikeDehling · 2017-06-12T20:24:11Z

I think that option is enabled now - can you edit? Thanks!

coveralls · 2017-06-13T02:18:49Z

Coverage increased (+0.2%) to 93.66% when pulling f07a7c9 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

rasbt · 2017-06-13T02:42:22Z

Alright, I just set up the documentation so that it can be uploaded to the web documentation when I make the next mlxtend version release. I also made some small modifications to the Jupyter notebook. If that looks okay to you, I think this PR is ready to be merged :)

coveralls · 2017-06-13T02:52:38Z

Coverage increased (+0.2%) to 93.66% when pulling cba8366 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

EikeDehling · 2017-06-13T16:29:45Z

Hi @rasbt, cool, looks great to me! Thanks for that :-) Eike

rasbt · 2017-06-13T16:35:14Z

Thanks for all the work and great contribution, really appreciate it!

EikeDehling · 2017-06-13T20:14:38Z

Cool, great!

Eike Dehling added 3 commits June 8, 2017 13:27

OutOfFoldStackingRegressor and tests

b773934

Fix flake8 style issues

a79e35a

Link to documentation and origin of parts of code

b650b1a

Eike Dehling added 4 commits June 9, 2017 11:09

Fix pep8 warnings

8c395e2

Fix pep8 warning

9c16f81

Fix unittest for python 3

6a09e49

Fix unittest

b76f9ce

Rename to StackingCVRegressor

788f8bd

rasbt reviewed Jun 10, 2017

View reviewed changes

Eike Dehling added 2 commits June 11, 2017 21:50

Add re-training, use_features_in_secondary ; add some docs

bdc4ea6

Fix flake8 warning

a29452b

Extend jupyter notebook with docs & examples

530d7ee

Revert accidental changes to classifier notebook

6ffb708

test push to fork

f07a7c9

StackingCVRegressor docs setup

cba8366

rasbt added New Feature in progress labels Jun 13, 2017

rasbt added good-to-merge and removed in progress labels Jun 13, 2017

rasbt mentioned this pull request Jun 13, 2017

StackingCVClassifier & StackingCVRegressor should use any cross validation strategy #202

Closed

rasbt merged commit e688f7d into rasbt:master Jun 13, 2017

rasbt mentioned this pull request Jun 23, 2017

0.7 release #206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of fold stacking regressor #201

Out of fold stacking regressor #201

EikeDehling commented Jun 8, 2017 •

edited by rasbt

Loading

pep8speaks commented Jun 8, 2017 •

edited

Loading

rasbt commented Jun 8, 2017

coveralls commented Jun 9, 2017

EikeDehling commented Jun 9, 2017

rasbt commented Jun 9, 2017

EikeDehling commented Jun 10, 2017

coveralls commented Jun 10, 2017

EikeDehling commented Jun 10, 2017

rasbt commented Jun 10, 2017

rasbt Jun 10, 2017

rasbt commented Jun 10, 2017 •

edited

Loading

rasbt commented Jun 10, 2017

EikeDehling commented Jun 11, 2017

coveralls commented Jun 11, 2017

coveralls commented Jun 11, 2017

EikeDehling commented Jun 12, 2017

coveralls commented Jun 12, 2017

coveralls commented Jun 12, 2017

rasbt commented Jun 12, 2017

EikeDehling commented Jun 12, 2017

coveralls commented Jun 13, 2017

rasbt commented Jun 13, 2017

coveralls commented Jun 13, 2017

EikeDehling commented Jun 13, 2017

rasbt commented Jun 13, 2017

EikeDehling commented Jun 13, 2017

Out of fold stacking regressor #201

Out of fold stacking regressor #201

Conversation

EikeDehling commented Jun 8, 2017 • edited by rasbt Loading

Description

Pull Request requirements

pep8speaks commented Jun 8, 2017 • edited Loading

Comment last updated on June 13, 2017 at 02:40 Hours UTC

rasbt commented Jun 8, 2017

coveralls commented Jun 9, 2017

EikeDehling commented Jun 9, 2017

rasbt commented Jun 9, 2017

EikeDehling commented Jun 10, 2017

coveralls commented Jun 10, 2017

EikeDehling commented Jun 10, 2017

rasbt commented Jun 10, 2017

rasbt Jun 10, 2017

Choose a reason for hiding this comment

rasbt commented Jun 10, 2017 • edited Loading

StackingCVClassifier

Current StackingCVRegressor

Diff

rasbt commented Jun 10, 2017

EikeDehling commented Jun 11, 2017

coveralls commented Jun 11, 2017

coveralls commented Jun 11, 2017

EikeDehling commented Jun 12, 2017

coveralls commented Jun 12, 2017

coveralls commented Jun 12, 2017

rasbt commented Jun 12, 2017

EikeDehling commented Jun 12, 2017

coveralls commented Jun 13, 2017

rasbt commented Jun 13, 2017

coveralls commented Jun 13, 2017

EikeDehling commented Jun 13, 2017

rasbt commented Jun 13, 2017

EikeDehling commented Jun 13, 2017

EikeDehling commented Jun 8, 2017 •

edited by rasbt

Loading

pep8speaks commented Jun 8, 2017 •

edited

Loading

rasbt commented Jun 10, 2017 •

edited

Loading