Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to fit metaclassifier with StackingCVClassifier #605

Closed
iagayo opened this issue Oct 10, 2019 · 6 comments · Fixed by #606
Closed

Not able to fit metaclassifier with StackingCVClassifier #605

iagayo opened this issue Oct 10, 2019 · 6 comments · Fixed by #606

Comments

@iagayo
Copy link

iagayo commented Oct 10, 2019

Hi there,
I'm trying to perform text classification with stacking. I'm new in ML, so apologies if this is a silly question.
I'm trying to train the same algorithm, LogisticRegression on different textual features to create different classifiers and then use a meta-classifier (also LogisticRegression) to join them all. The features I'm using are the words in the text and the corresponding Part-of-Speech tags.

The classifier that uses words as a feature is defined with the following pipeline:

lr =LogisticRegression()

words = make_pipeline(ColumnSelector(column='text'), 
                      CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000),						
                      lr)

The classifier that uses POS as a feature is defined with the following pipeline:

pos = make_pipeline(ColumnSelector(column='pos'),
                                    CountVectorizer(binary=True, ngram_range=(2,3), 
                                    max_features=5000),
                                    lr)

Finally, the metaclassifier is defined this way:

sclf = StackingCVClassifier(classifiers=[words, pos], 
                            meta_classifier=lr)

The problem comes when I try to train the classifier:

classifiers = {"Words": words,
               "POS": pos,
               "Stack": sclf}
for key in classifiers:
       classifier = classifiers[key]
       classifier.fit(X_train, Y_train)

Words and POS are fitted, but the Stack classifier is not and I get the following error:

IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

X_train contains a dataframe with a colum "text" that contains the raw text and a column "pos" that contains the raw POS tags, that's why I apply the transformations needed through the pipelines.

When I try the same with the StackingClassifier method, I don't have this problem.
Any idea about what's going wrong?

Thanks!

@rasbt rasbt added the Question label Oct 11, 2019
@rasbt
Copy link
Owner

rasbt commented Oct 11, 2019

I can confirm, there seems to be an issue. For instance, the following self-contained example works fine for the StackingClassifier:

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
from mlxtend.classifier import StackingCVClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np


categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']


twenty_train = fetch_20newsgroups(subset='train',
  categories=categories, shuffle=True, random_state=42)

X_data = twenty_train.data
y = twenty_train.target




lr1 = LogisticRegression()
lr2 = LogisticRegression()
lr3 = LogisticRegression()


words = make_pipeline(
                      CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000),
                      lr1)

pos = make_pipeline(
                    CountVectorizer(binary=True, ngram_range=(2,3), 
                                    max_features=5000),
                    lr2)

sclf = StackingClassifier(classifiers=[words, pos], meta_classifier=lr3)
sclf.fit(X_data, y)

However, for the StackingCVClassifier, there seems to be a bug with that text data pipeline input:

scvclf = StackingCVClassifier(classifiers=[words, pos], meta_classifier=lr3, cv=5)
scvclf.fit(X_data, y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-2f5cf1116593> in <module>
      1 scvclf = StackingCVClassifier(classifiers=[words, pos], meta_classifier=lr3, cv=5)
----> 2 scvclf.fit(X_data, y)

~/miniconda3/lib/python3.7/site-packages/mlxtend/classifier/stacking_cv_classification.py in fit(self, X, y, groups, sample_weight)
    209 
    210         # Input validation.
--> 211         X, y = check_X_y(X, y, accept_sparse=['csc', 'csr'], dtype=None)
    212 
    213         if sample_weight is None:

~/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    717                     ensure_min_features=ensure_min_features,
    718                     warn_on_dtype=warn_on_dtype,
--> 719                     estimator=estimator)
    720     if multi_output:
    721         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    519                     "Reshape your data either using array.reshape(-1, 1) if "
    520                     "your data has a single feature or array.reshape(1, -1) "
--> 521                     "if it contains a single sample.".format(array))
    522 
    523         # in the future np.flexible dtypes will be handled like object dtypes
...  "From: dyer@spdcc.com (Steve Dyer)\nSubject: Re: Is MSG sensitivity superstition?\nOrganization: S.P. Dyer Computer Consulting, Cambridge MA\nLines: 14\n\nIn article <1qnns0$4l3@agate.berkeley.edu> spp@zabriskie.berkeley.edu (Steve Pope) writes:\n>The mass of anectdotal evidence, combined with the lack of\n>a properly constructed scientific experiment disproving\n>the hypothesis, makes the MSG reaction hypothesis the\n>most likely explanation for events.\n\nYou forgot the smiley-face.\n\nI can't believe this is what they turn out at Berkeley.  Tell me\nyou're an aberration.\n\n-- \nSteve Dyer\ndyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer\n"].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Looks like it is passing the X_data array somewhere. It could be just due to the checking function though ...

@rasbt
Copy link
Owner

rasbt commented Oct 11, 2019

For the example I posted above, disabling the input checking will solve the issue. The issue was that the inputs were checked before they were passed to the pipeline. Since the input data is text data, this would cause issues. So, there's currently no good way for checking inputs if pipelines are used.

Not sure if it will solve the DataFrame issue though. I will let you know once I merged it into master.

@iagayo
Copy link
Author

iagayo commented Oct 11, 2019

Great, thanks!

@rasbt
Copy link
Owner

rasbt commented Oct 11, 2019

Alright, the changes should be in the master branch now. Can you install the latest dev version via

pip install git+git://github.com/rasbt/mlxtend.git

and give it another try?

@iagayo
Copy link
Author

iagayo commented Oct 14, 2019

Done, it works now. Thanks!

@rasbt
Copy link
Owner

rasbt commented Oct 14, 2019

awesome, thanks for letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants