Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with StackingCVClassifier train_meta_features_? #366

Closed
AllardJM opened this issue Apr 16, 2018 · 2 comments
Closed

Issue with StackingCVClassifier train_meta_features_? #366

AllardJM opened this issue Apr 16, 2018 · 2 comments

Comments

@AllardJM
Copy link

AllardJM commented Apr 16, 2018

Using version 0.11.0

There seems to be a couple issues with this attribute, which would certainly be useful.

1) sclf.train_meta_features_.shape is actually (number of training rows, number of classifiers *2) because both classes predictions (in the case of a binary problem) are maintained.
2) I have doubts that the index order is maintained when setting stratify =True or shuffle = True

Here is an example where the resulting meta_features appear to not be in the correct order with the original Y

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict


import numpy as np
import pandas as pd

import mlxtend
from mlxtend.classifier import StackingCVClassifier


mlxtend.__version__

data = load_breast_cancer(return_X_y=True)
X=data[0]
y=data[1]

clf = RandomForestClassifier(random_state=1, n_estimators=100)
lr = LogisticRegression()

RANDOM_SEED=3245
np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[clf],
                            use_probas=True,
                            meta_classifier=lr, store_train_meta_features=True, refit=True, cv=5, verbose=1,stratify=True,shuffle=True)



sclf.fit(X,y)


roc_auc_score(y,sclf.train_meta_features_[:,1])
0.45944981766291426

preds=cross_val_predict(clf,X,y,cv=5,method='predict_proba')
roc_auc_score(y,preds[:,1])


0.9895354368162359

Here is an example where the order is apparently correct

sclf = StackingCVClassifier(classifiers=[clf],
                            use_probas=True,
                            meta_classifier=lr, store_train_meta_features=True, refit=True, cv=5, verbose=1,stratify=False,shuffle=False)

sclf.fit(X,y)

roc_auc_score(y,sclf.train_meta_features_[:,1])

0.9888351567041911
@rasbt
Copy link
Owner

rasbt commented Apr 16, 2018

  1. sclf.train_meta_features_.shape is actually (number of training rows, number of classifiers *2) because both classes predictions (in the case of a binary problem) are maintained.

Thanks for pointing that out. In the docs, it was probably a copy & paste error when it was ported from the StackingClassifier docs (have to double-check). I think it's correct though if self.use_probas=False but should be corrected for the self.use_probas=True scenario

  1. I have doubts that the index order is maintained when setting stratify =True or shuffle = True

I think you are right. I was just inspecting the code, and the meta features get saved, and after that the reordering of the labels is done:

        if self.store_train_meta_features:
            self.train_meta_features_ = all_model_predictions

        # We have to shuffle the labels in the same order as we generated
        # predictions during CV (we kinda shuffled them when we did
        # Stratified CV).
        # We also do the same with the features (we will need this only IF
        # use_features_in_secondary is True)
        reordered_labels = np.array([]).astype(y.dtype)
        reordered_features = np.array([]).reshape((0, X.shape[1]))\
            .astype(X.dtype)
        for train_index, test_index in skf:
            reordered_labels = np.concatenate((reordered_labels,
                                               y[test_index]))
            reordered_features = np.concatenate((reordered_features,
                                                 X[test_index]))

Instead of reordering the labels, it might be better to reorder the meta features (aka all_model_predictions) instead. I think this should solve the problem then.

@rasbt
Copy link
Owner

rasbt commented Apr 20, 2018

I just merged a fix, the meta features should be saved in the order of the original labels now! Note that I also renamed the refit param to use_clones based on the other discussion. The changes are currently only in master, but I was planning to make a new release version soon.
To install the current master branch version, you can do

pip install git+git://github.com/rasbt/mlxtend.git

Anyways, thanks a lot for pointing these issues out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants