Not able to fit metaclassifier with StackingCVClassifier #605

iagayo · 2019-10-10T13:20:48Z

Hi there,
I'm trying to perform text classification with stacking. I'm new in ML, so apologies if this is a silly question.
I'm trying to train the same algorithm, LogisticRegression on different textual features to create different classifiers and then use a meta-classifier (also LogisticRegression) to join them all. The features I'm using are the words in the text and the corresponding Part-of-Speech tags.

The classifier that uses words as a feature is defined with the following pipeline:

lr =LogisticRegression()

words = make_pipeline(ColumnSelector(column='text'), 
                      CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000),						
                      lr)

The classifier that uses POS as a feature is defined with the following pipeline:

pos = make_pipeline(ColumnSelector(column='pos'),
                                    CountVectorizer(binary=True, ngram_range=(2,3), 
                                    max_features=5000),
                                    lr)

Finally, the metaclassifier is defined this way:

sclf = StackingCVClassifier(classifiers=[words, pos], 
                            meta_classifier=lr)

The problem comes when I try to train the classifier:

classifiers = {"Words": words,
               "POS": pos,
               "Stack": sclf}

for key in classifiers:
       classifier = classifiers[key]
       classifier.fit(X_train, Y_train)

Words and POS are fitted, but the Stack classifier is not and I get the following error:

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

X_train contains a dataframe with a colum "text" that contains the raw text and a column "pos" that contains the raw POS tags, that's why I apply the transformations needed through the pipelines.

When I try the same with the StackingClassifier method, I don't have this problem.
Any idea about what's going wrong?

Thanks!

The text was updated successfully, but these errors were encountered:

rasbt · 2019-10-11T16:09:46Z

I can confirm, there seems to be an issue. For instance, the following self-contained example works fine for the StackingClassifier:

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np


categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']


twenty_train = fetch_20newsgroups(subset='train',
  categories=categories, shuffle=True, random_state=42)

X_data = twenty_train.data
y = twenty_train.target




lr1 = LogisticRegression()
lr2 = LogisticRegression()
lr3 = LogisticRegression()


words = make_pipeline(
                      CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000),
                      lr1)

pos = make_pipeline(
                    CountVectorizer(binary=True, ngram_range=(2,3), 
                                    max_features=5000),
                    lr2)

sclf = StackingClassifier(classifiers=[words, pos], meta_classifier=lr3)
sclf.fit(X_data, y)

However, for the StackingCVClassifier, there seems to be a bug with that text data pipeline input:

scvclf = StackingCVClassifier(classifiers=[words, pos], meta_classifier=lr3, cv=5)
scvclf.fit(X_data, y)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-2f5cf1116593> in <module>
      1 scvclf = StackingCVClassifier(classifiers=[words, pos], meta_classifier=lr3, cv=5)
----> 2 scvclf.fit(X_data, y)

~/miniconda3/lib/python3.7/site-packages/mlxtend/classifier/stacking_cv_classification.py in fit(self, X, y, groups, sample_weight)
    209 
    210         # Input validation.
--> 211         X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)
    212 
    213         if sample_weight is None:

~/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    519                     "Reshape your data either using array.reshape(-1, 1) if "
    520                     "your data has a single feature or array.reshape(1, -1) "
--> 521                     "if it contains a single sample.".format(array))
    522 
    523         # in the future np.flexible dtypes will be handled like object dtypes
...  "From: dyer@spdcc.com (Steve Dyer)\nSubject: Re: Is MSG sensitivity superstition?\nOrganization: S.P. Dyer Computer Consulting, Cambridge MA\nLines: 14\n\nIn article <1qnns0$4l3@agate.berkeley.edu> spp@zabriskie.berkeley.edu (Steve Pope) writes:\n>The mass of anectdotal evidence, combined with the lack of\n>a properly constructed scientific experiment disproving\n>the hypothesis, makes the MSG reaction hypothesis the\n>most likely explanation for events.\n\nYou forgot the smiley-face.\n\nI can't believe this is what they turn out at Berkeley.  Tell me\nyou're an aberration.\n\n-- \nSteve Dyer\ndyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer\n"].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Looks like it is passing the X_data array somewhere. It could be just due to the checking function though ...

rasbt · 2019-10-11T16:22:22Z

For the example I posted above, disabling the input checking will solve the issue. The issue was that the inputs were checked before they were passed to the pipeline. Since the input data is text data, this would cause issues. So, there's currently no good way for checking inputs if pipelines are used.

Not sure if it will solve the DataFrame issue though. I will let you know once I merged it into master.

iagayo · 2019-10-11T16:31:33Z

Great, thanks!

rasbt · 2019-10-11T17:42:34Z

Alright, the changes should be in the master branch now. Can you install the latest dev version via

pip install git+git://github.com/rasbt/mlxtend.git

and give it another try?

iagayo · 2019-10-14T09:23:38Z

Done, it works now. Thanks!

rasbt · 2019-10-14T23:19:26Z

awesome, thanks for letting me know.

rasbt added the Question label Oct 11, 2019

rasbt added Bug help wanted and removed Question labels Oct 11, 2019

rasbt mentioned this issue Oct 11, 2019

Disable input checking in StackingCVClassifier to support pipelines #606

Merged

5 tasks

rasbt closed this as completed in #606 Oct 11, 2019

rasbt mentioned this issue Nov 8, 2019

StackingCVClassifier fails on pandas DataFrames #623

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to fit metaclassifier with StackingCVClassifier #605

Not able to fit metaclassifier with StackingCVClassifier #605

iagayo commented Oct 10, 2019

rasbt commented Oct 11, 2019

rasbt commented Oct 11, 2019

iagayo commented Oct 11, 2019

rasbt commented Oct 11, 2019

iagayo commented Oct 14, 2019

rasbt commented Oct 14, 2019

Not able to fit metaclassifier with StackingCVClassifier #605

Not able to fit metaclassifier with StackingCVClassifier #605

Comments

iagayo commented Oct 10, 2019

rasbt commented Oct 11, 2019

rasbt commented Oct 11, 2019

iagayo commented Oct 11, 2019

rasbt commented Oct 11, 2019

iagayo commented Oct 14, 2019

rasbt commented Oct 14, 2019