Issue with StackingCVClassifier train_meta_features_? #366

AllardJM · 2018-04-16T14:49:54Z

Using version 0.11.0

There seems to be a couple issues with this attribute, which would certainly be useful.

1) sclf.train_meta_features_.shape is actually (number of training rows, number of classifiers *2) because both classes predictions (in the case of a binary problem) are maintained.
2) I have doubts that the index order is maintained when setting stratify =True or shuffle = True

Here is an example where the resulting meta_features appear to not be in the correct order with the original Y

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict


import numpy as np
import pandas as pd

import mlxtend
from mlxtend.classifier import StackingCVClassifier


mlxtend.__version__

data = load_breast_cancer(return_X_y=True)
X=data[0]
y=data[1]

clf = RandomForestClassifier(random_state=1, n_estimators=100)
lr = LogisticRegression()

RANDOM_SEED=3245
np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[clf],
                            use_probas=True,
                            meta_classifier=lr, store_train_meta_features=True, refit=True, cv=5, verbose=1,stratify=True,shuffle=True)



sclf.fit(X,y)


roc_auc_score(y,sclf.train_meta_features_[:,1])
0.45944981766291426

preds=cross_val_predict(clf,X,y,cv=5,method='predict_proba')
roc_auc_score(y,preds[:,1])


0.9895354368162359

Here is an example where the order is apparently correct

sclf = StackingCVClassifier(classifiers=[clf],
                            use_probas=True,
                            meta_classifier=lr, store_train_meta_features=True, refit=True, cv=5, verbose=1,stratify=False,shuffle=False)

sclf.fit(X,y)

roc_auc_score(y,sclf.train_meta_features_[:,1])

0.9888351567041911

The text was updated successfully, but these errors were encountered:

rasbt · 2018-04-16T15:35:59Z

sclf.train_meta_features_.shape is actually (number of training rows, number of classifiers *2) because both classes predictions (in the case of a binary problem) are maintained.

Thanks for pointing that out. In the docs, it was probably a copy & paste error when it was ported from the StackingClassifier docs (have to double-check). I think it's correct though if self.use_probas=False but should be corrected for the self.use_probas=True scenario

I have doubts that the index order is maintained when setting stratify =True or shuffle = True

I think you are right. I was just inspecting the code, and the meta features get saved, and after that the reordering of the labels is done:

        if self.store_train_meta_features:
            self.train_meta_features_ = all_model_predictions

        # We have to shuffle the labels in the same order as we generated
        # predictions during CV (we kinda shuffled them when we did
        # Stratified CV).
        # We also do the same with the features (we will need this only IF
        # use_features_in_secondary is True)
        reordered_labels = np.array([]).astype(y.dtype)
        reordered_features = np.array([]).reshape((0, X.shape[1]))\
            .astype(X.dtype)
        for train_index, test_index in skf:
            reordered_labels = np.concatenate((reordered_labels,
                                               y[test_index]))
            reordered_features = np.concatenate((reordered_features,
                                                 X[test_index]))

Instead of reordering the labels, it might be better to reorder the meta features (aka all_model_predictions) instead. I think this should solve the problem then.

rasbt · 2018-04-20T00:31:39Z

I just merged a fix, the meta features should be saved in the order of the original labels now! Note that I also renamed the refit param to use_clones based on the other discussion. The changes are currently only in master, but I was planning to make a new release version soon.
To install the current master branch version, you can do

pip install git+git://github.com/rasbt/mlxtend.git

Anyways, thanks a lot for pointing these issues out!

rasbt added the Enhancement label Apr 17, 2018

rasbt mentioned this issue Apr 19, 2018

Fixes meta feature reshuffling bug in StackingCVClassifier #370

Merged

5 tasks

rasbt closed this as completed in #370 Apr 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with StackingCVClassifier train_meta_features_? #366

Issue with StackingCVClassifier train_meta_features_? #366

AllardJM commented Apr 16, 2018 •

edited

Loading

rasbt commented Apr 16, 2018 •

edited

Loading

rasbt commented Apr 20, 2018

Issue with StackingCVClassifier train_meta_features_? #366

Issue with StackingCVClassifier train_meta_features_? #366

Comments

AllardJM commented Apr 16, 2018 • edited Loading

rasbt commented Apr 16, 2018 • edited Loading

rasbt commented Apr 20, 2018

AllardJM commented Apr 16, 2018 •

edited

Loading

rasbt commented Apr 16, 2018 •

edited

Loading