Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of fold stacking regressor #201

Merged
merged 14 commits into from
Jun 13, 2017

Conversation

EikeDehling
Copy link
Contributor

@EikeDehling EikeDehling commented Jun 8, 2017

Description

I've implemented a new ensemble regressor, for out-of-fold stacking. It's a different approach for training base regressors, that better avoids overfitting. For description of algorithm, see:

https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/#Stacking

I've only implemented the algorithm and some basic tests, but not written documentation yet - right now i'd like to know if you are interested in including this algorithm in the mlxtend code base before i spend more time.

If you're interested to include this, i'm happy to iterate on review / code!

Thanks for taking a look at this!

Pull Request requirements

  • Added appropriate unit test functions in the ./mlxtend/*/tests directories
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass
  • Checked the test coverage by running nosetests ./mlxtend --with-coverage
  • Checked for style issues by running flake8 ./mlxtend
  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
  • Modify documentation in the appropriate location under mlxtend/docs/sources/ (optional)
  • Checked that the Travis-CI build passed at https://travis-ci.org/rasbt/mlxtend

@pep8speaks
Copy link

pep8speaks commented Jun 8, 2017

Hello @EikeDehling! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 13, 2017 at 02:40 Hours UTC

@rasbt
Copy link
Owner

rasbt commented Jun 8, 2017

Hi, @EikeDehling ,

thanks a lot for the PR, this sounds awesome, and I am really looking forward to contributions like these! I haven't read through the code in detail, but based on the description on your website, it looks like it is an approach similar to the StackingCVClassifier (http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/) but for regression?

In the StackingCVClassifier the 1st level models are fit to the training fold, and their prediction on the test fold are then used to fit the 2nd level model. It sounds like the OutOfFoldStackingRegressor is doing the same thing for training? However, for the prediction, the StackingCVClassifier does the following: 1st level models make individual predictions, the 2nd level classifier uses these predictions to make its prediction, which is then regarded as the output of the stacking classifier. When I understand correctly (based on the text description), the OutOfFoldStackingRegressor uses the same approach for the test set prediction?

However, based on the figure, it looks like it adds the average of the 1st-level models to the final prediction? Or in other words, the OutOfFoldStackingRegressor test prediction is composed of two terms. The averaged 1st-level predictions + the prediction of the 2nd level model?

I am just wondering about the differences and similarities between the StackingCVClassifier and OutOfFoldStackingRegressor, because if they are indeed similar, we should probably use the same name (e.g., StackingCVRegressor and refer to it as sth like StackingCVRegressor implements out-of-fold stacking for regression models in the documentation). What do you think?

Generally, I am all in favor of adding this implementation to mlxtend, thanks a lot for considering this contribution!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.637% when pulling b76f9ce on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@EikeDehling
Copy link
Contributor Author

Hi @rasbt,

thanks for the quick response.

The algorithm indeed looks like StackingCVClassifier, but adapted for regression. It looks like there are some subtle differences though.

I'm fine with your naming suggestion, only remark is: There is no cross-validation going on in the algorithm as far as i can see? I don't have a strong opinion on this naming though. Happy to go with your choice.

The algorithm divides training data into K folds. I then trains N instances of each base model type, each on K-1 parts of training data. Each instance makes predictions of the remaining piece of training data. Then the predictions for each model are concatenated and used as input for the second level model. This is all identical to the StackingCVClassifier.

Now the difference starts, the N instances of the base model are kept, and used during predicting. There it does N predictions, one with each instance of the base model, and averages them as input for the second level model. Then the second level model predicts final output.

I'm not a theoretical ML expert, so i'm not sure what approach works better. Perhaps the ideas from StackingCVClassifier would even be an improvement? I would be happy to run some experiments and then code the best performing version.

Best regards, Eike

@rasbt
Copy link
Owner

rasbt commented Jun 9, 2017

Thanks for the thorough explanation, @EikeDehling !

I'm fine with your naming suggestion, only remark is: There is no cross-validation going on in the algorithm as far as i can see? I don't have a strong opinion on this naming though. Happy to go with your choice.

I agree with you, it's not really a k-fold cross-validation going on here, but just a k-fold like sampling. I think we chose that name because we had the StackingClassifier and they called in Stacking classification with k-fold cross-validation in

  • Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.

I also don't have a strong preference, but maybe we should stick to StackingCVRegressor for simplicity/analogy.

I'm not a theoretical ML expert, so i'm not sure what approach works better. Perhaps the ideas from StackingCVClassifier would even be an improvement? I would be happy to run some experiments and then code the best performing version.

Is this based on some paper in literature, or is this a "new" algorithm you came up with during experimentation? If you'd run some experiments to compare both approaches, that would be great, but don't worry about it if it's too much work (however, that could also be an interesting study for a paper if this hasn't been done before :)).

Now the difference starts, the N instances of the base model are kept, and used during predicting. There it does N predictions, one with each instance of the base model, and averages them as input for the second level model. Then the second level model predicts final output.

Say we have n=10 samples in the dataset and k=5 1st-level regressors. Then, during prediction, the 1st level regressors first produce n*k=50 predictions, which are then averaged over k so that the 2nd level regressor gets n values instead of all 50 values? And the final prediction is the prediction of the 2nd level regressor on those n values? I think the only difference to the StackingCVClassifier would be the averaging part. I.e., if you would stack the predictions instead of averaging them prior to giving them to the 2nd-level classifier, the algorithms would be identical except classification vs regression.

If that's indeed the case, I'd suggest to maybe toggle the averaging by a parameters. For example,

StackingCVRegressor(, ..., use_averaged_predictions_in_secondary)

or sth like that. The default could then be the setting (True/False) that would work better in practice.

PS: There's another parameter in the StackingCVClassifier that could be transferred to the StackingCVRegressor`:

  • use_features_in_secondary : bool (default: False)

If True, the meta-classifier will be trained both on the predictions of the original classifiers and the original dataset. If False, the meta-classifier will be trained only on the predictions of the original classifiers.

I am not sure if that helps with the performance of the StackingCVRegressor in practice, though. But it could be included as an additional option so that users can run their own experiments ...

(I just see that "predictions of the original classifiers." should probably be changed to "predictions of the level-1 classifiers." for clarity.

@EikeDehling
Copy link
Contributor Author

Hi !

The algorithm came from here, there is a good graphical explanation also:

https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/#Stacking

Renamed to StackingCVRegressor now.

I will do some experiments on re-training the level-1 models vs using multiple and averaging, will let you know. Also will have a look at documenting this.

Thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.637% when pulling 788f8bd on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@EikeDehling
Copy link
Contributor Author

I've tried out what difference the approaches make:

  1. Retrain one instance of each layer-1 model on full data and use that as input for level-2 predictions
  2. Train K instances of each layer-1 model on part of the data, keep them around and average their predictions per model ; result is one column of data per level-1 model as input for the level-2 predictinos.

Approach 2 is what the existing StackingCVClassifier does, approach 1 is what i saw documented elsewhere.

My results: https://www.kaggle.com/eikedehling/trying-out-stacking-approaches/code

Summary: It doesn't make any significant difference for results. Sometimes one version is slightly better, sometimes the other version does a bit better.

@rasbt
Copy link
Owner

rasbt commented Jun 10, 2017

Thanks for testing these out! Just wanted to mention that in 2), the model 1 classifiers (in the StackingCVClassifier) are also fit on the whole training set in the end as per
https://github.com/rasbt/mlxtend/blob/master/mlxtend/classifier/stacking_cv_classification.py#L205

If it's not too complicated, it would be nice to have a parameter to toggle between the two different approaches like mentioned above : StackingCVRegressor(, ..., use_averaged_predictions_in_secondary). If that makes the implementation too complicated, let's maybe not worry about that.

Either way, as long it's documented what it does, it would be fine :)

# is trained and makes predictions, after which we train the
# meta-regressor on their combined results.
#
for i, clf in enumerate(self.regressors):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that important since it's an internal variable, but I would suggest changing clf to regr or so

@rasbt
Copy link
Owner

rasbt commented Jun 10, 2017

I just re-examined the code and made a high-level summary for myself:

StackingCVClassifier

  • create a clone of each level-1 classifier (say we have 5)
  • for each fold in the training set:
    • for each level-1 clone:
      • fit level-1 clone to training fold
      • predict labels in the test fold
      • add prediction to single_model_prediction array

if we have cv=3, that is 3 test folds, and each
test fold has 50 samples, the resulting
single_model_prediction array
should be shape [150, 1] for each model

  • stack the different single_model_prediction arrays into a all_pred_array
    if we have 45 level-1 models, the resulting array
    has now shape [150, 5]

  • fit level-2 classifier to the [150, 5] array

  • refit all level-1 classifiers to the original dataset [150, n_features] array

  • in prediction, for each level-1 clone:

    • predict labels in the test dataset
    • add prediction to single_model_prediction array
  • stack the different single_model_prediction arrays into a all_pred_array

  • make final prediction on the [n_samples, level-1_models] array via the 2nd level classifier

Current StackingCVRegressor

  • for regressor in level-1 regressors (say we have 5)
    • for training fold in dataset (say we have 3)
      • make a copy of the regressor
      • fit regressor to training fold
      • predict on test fold

if the test fold has 50 samples, and we have 3 test folds,
we get an array of shape [150, 5], similar to the StackingCVClassifier

  • fit meta-regressor on the [150, 5] test fold predictions

For the test set prediction, we use all n_level1_regressors x n_folds regressors,
so we have 5x3=15 regressors for the test set . This is done like this
[[regr1, regr1, regr1],..., [regr5, regr5, regr5]. Then for each regr_i, we compute the average prediction that is passed to the meta-regressor.

Diff

  • The `StackingCVClassifier is refit to the whole training set after fitting the meta-classifier
  • The StackingCVRegressor has multiple regressors, one created for each fold during training. In contrast, instead of creating a level-1 classifier for each fold, the StackingClassifier's level-1 classifiers are fit on the whole training set.

I think your implementation looks fine! I probably wouldn't add more complexity if the approach works well in practice. We just need to document the behavior properly :).

@rasbt
Copy link
Owner

rasbt commented Jun 10, 2017

Maybe you could prepare a Jupyter Notebook similar to the one for the StackingCVClassifier and I can add a summarizing figure to your text as it's done for the StackingCVClassifier then -- also to dobule-check if that's okay with you.

@EikeDehling
Copy link
Contributor Author

Hi @rasbt,

Thanks for the feedback and review!

I've adjusted the StackingCVRegressor to match StackingCVClassifiers algorithm (training level-1 regressors on full dataset ; get rid of the N regressors and averaging). The results were just as good, so maybe stick with that approach.. Also implemented the use_features_in_secondary option.

I've made a start with a notebook in the docs, will let you know when i'm done there.

Best, Eike

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling a29452b on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling a29452b on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@EikeDehling
Copy link
Contributor Author

I think the jupyter notebook now covers the important things. The other notebooks also include API docs, maybe you can help me to get that working. I assume that's automatically generated?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling 530d7ee on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling 6ffb708 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Jun 12, 2017

Looks great so far! And I'd be happy with the documentation details -- it's mostly automated, but I think it'd be easier if I can apply the respective changes.

If you don't mind, could you enable the "allow maintainers to edit" option?

allow-maintainers-to-make-edits-sidebar-checkbox

@EikeDehling
Copy link
Contributor Author

I think that option is enabled now - can you edit? Thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling f07a7c9 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Jun 13, 2017

Alright, I just set up the documentation so that it can be uploaded to the web documentation when I make the next mlxtend version release. I also made some small modifications to the Jupyter notebook. If that looks okay to you, I think this PR is ready to be merged :)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 93.66% when pulling cba8366 on EikeDehling:out_of_fold_stacking_regressor into d6426f3 on rasbt:master.

@EikeDehling
Copy link
Contributor Author

Hi @rasbt, cool, looks great to me! Thanks for that :-) Eike

@rasbt rasbt merged commit e688f7d into rasbt:master Jun 13, 2017
@rasbt
Copy link
Owner

rasbt commented Jun 13, 2017

Thanks for all the work and great contribution, really appreciate it!

@EikeDehling
Copy link
Contributor Author

Cool, great!

@rasbt rasbt mentioned this pull request Jun 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants