Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend `FeatureUnion` to better handle heterogeneous data #2034

Closed
jnothman opened this issue Jun 5, 2013 · 51 comments · Fixed by #9012

Comments

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 5, 2013

FeatureUnion currently passes identical data to each constituent transformer. Often one wants to differentiate between groups of features in how they are transformed. While this is possible by making each stacked transformer a Pipeline consisting of a pre-determined feature selector and another transformer, this is cumbersome.

A parameter should be added to specify which features are routed to which constituents. This is not necessarily trivial to design, particularly because the input X to FeatureUnion.transform need not be a conventional 2d feature array (it may be a list/array of dicts, texts, or other objects).

@kmike

This comment has been minimized.

Copy link
Contributor

@kmike kmike commented Sep 20, 2013

What's wrong with pipelines? I don't like that for each TransformerMixin subclass that just extracts some features

    def fit(self, X, y=None):
        return self

have to be implemented, but other than that using Pipelines looks quite sane and flexible.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Sep 29, 2013

Pipelines are fine functionally, but are cumbersome to specify and to address their parameters. To process documents which consist of a dict with fields 'publisher', 'headline', 'body', you would have me do something like:

class GetItemTransformer(TransformerMixin):
    def __init__(self, field):
        self.field = field
    # assume default fit()
    def transform(X):
        return X[field]


transformer = FeatureUnion([
    ('body', Pipeline([
        ('get', GetItemTransformer('body')),
        ('transform', TfidfTransformer())
    ]),
    ('headline', Pipeline([
        ('get', GetItemTransformer('headline')),
        ('transform', TfidfTransformer())
    ]),
    ('publisher', Pipeline([
        ('get', GetItemTransformer('publisher')),
        ('transform', OneHotEncoder())
    ]),
])

clf = GridSearchCV(
    Pipeline([
        ('extract', transformer),
        ('clf', LinearSVC())
    ]),
    {
        'clf__C': [1, .1],
        'extract__body__transform__norm': ['l1', 'l2']
    }
)

I think such processing by field is a common enough use-case (see related issues above) to make it a bit simpler. (I think it was @ogrisel who expressed an interest in supporting this use-case more directly.)

I had imagined an extra FeatureUnion param to simplify its construction and param reference under the assumption that we most often want to get a single dict value from the input:

transformer = FeatureUnion([
    ('body', TfidfTransformer()),
    ('headline', TfidfTransformer()),
    ('publisher', OneHotEncoder()),
], transform_fields=['body', 'headline', 'publisher'])

clf = GridSearchCV(
    Pipeline([
        ('extract', transformer),
        ('clf', LinearSVC())
    ]),
    {
        'clf__C': [1, .1],
        'extract__body__norm': ['l1', 'l2']
    }
),

Or perhaps we would just benefit from a construction shorthand that allows:

  • Pipeline or FeatureUnion component names to be unspecified
  • automatic construction of a transformer from a function
  • interpretation of nested lists in FeatureUnion/Pipeline as Pipleine, and nested dicts as FeatureUnions

This is probably too complicated/implicit and leaves parameter names long.

pipe = compose([
    ('extract', {
        'body': [operator.itemgetter('body'), TfidfTransformer()],
        'headline': [operator.itemgetter('headline'), TfidfTransformer()],
        'publisher': [operator.itemgetter('publisher'), OneHotEncoder()],
    }),
    ('clf', LinearSVC())
])

clf = GridSearchCV(
    pipe,
    {
        'clf__C': [1, .1],
        'extract__body__step1__norm': ['l1', 'l2']
    }
),

Even if we keep it simple and take the first approach, it would be nice to provide an example.

@jnothman jnothman changed the title ENH extend `FeatureUnion` to better handle heterogeneous data Extend `FeatureUnion` to better handle heterogeneous data Aug 16, 2014
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

Hum do we want to make the current ugly usage an example? It seems pretty common and slightly non-trivial. Or do we rather fix this immediately ;)

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Nov 5, 2014

Ideally, this should be fixed, but agreeing on an API that is easy to use may be non-trivial too. I think an example is a must have, hence #3569 being posted long after this issue. An API is a nice-to-have. If they come at the same time, that's great...

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Nov 5, 2014

In terms of API, what do you think of:

FeatureUnion([('body', TfidfTransformer(), operator.itemgetter('body')), ('headline', TfidfTransformer(), operator.itemgetter('body'))])

or:

FeatureUnion([('body', TfidfTransformer()), ('headline', TfidfTransformer())], select=[operator.itemgetter('body'), operator.itemgetter('headline')])

(which accords more with the current way to weight transformers within a union)

Can we factor out the operator.itemgetter? Currently this is being a applied as X[field] (appropriate for Pandas) but many structures want an attrgetter, and data formatted like the input to DictVectorizer wants [x[field] for x in X].

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

I think it would be good to get rid of the operator.itemgetter. Why not give it a dict from field to transformer for example?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

So I'd do either

transformer = FeatureUnion([
    ('body', TfidfTransformer()),
    ('headline', TfidfTransformer()),
    ('publisher', OneHotEncoder()),
], transform_fields=['body', 'headline', 'publisher'])

or

transformer = FeatureUnion([
    ('body': TfidfTransformer()),
    ('headline': TfidfTransformer()),
    ('publisher': OneHotEncoder()),
])

That would be a bit tricky when you want the transformer name to be differnt from the field names, though, and I guess that is kinda common. Maybe by default use the estimator names as field names, and optionally provide the field names. I guess you need to use a list, not a dict, though, so the correspondence is clear.
From a syntax point of view I think the dict would be prettier ;)

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

Using the list syntax would prevent people from passing through the whole dict, though.
Ok crazy idea:

transformer = FeatureUnion({
    'body': TfidfTransformer(),
    'headline': TfidfTransformer(),
    'publisher': OneHotEncoder(),
})

as simple case and

transformer = FeatureUnion({
    'body__firstfeature': TfidfTransformer(),
    'body__secondfeature': TfidfTransformer(analyzer='word'),
    'headline': TfidfTransformer(),
    'publisher': OneHotEncoder(),
})

as more complex case? Doesn't make the names much prettier, though ;)

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

Or just use list and add an option "pass_through_dict=False"

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Nov 5, 2014

But what are the semantics? It's straightforward to get by field name if I
pass in a struct array or Pandas, but not if it's a list of dicts or a list
of namedtuples. If it's a numeric array, then the field specification needs
to be e.g. (Ellipsis, list_of_indices). Or we could allow a callable as an
override of course.

On 6 November 2014 10:07, Andreas Mueller notifications@github.com wrote:

So I'd do either

transformer = FeatureUnion([
('body', TfidfTransformer()),
('headline', TfidfTransformer()),
('publisher', OneHotEncoder()),
], transform_fields=['body', 'headline', 'publisher'])

or
transformer = FeatureUnion([
('body': TfidfTransformer()),
('headline': TfidfTransformer()),
('publisher': OneHotEncoder()),
])

That would be a bit tricky when you want the transformer name to be
differnt from the field names, though, and I guess that is kinda common.
Maybe by default use the estimator names as field names, and optionally
provide the field names. I guess you need to use a list, not a dict,
though, so the correspondence is clear.


Reply to this email directly or view it on GitHub
#2034 (comment)
.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

I would say if it is not a dict, sctruct array or Pandas df we pass everything through and providing transform_fields raises an error.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 5, 2014

You could do arbitrary collections of columns but I don't think I'd want to do that. The user should do that before passing the data in if they really want to.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Nov 6, 2014

Well, I presume if the data is not of the right kind (and the dict case
includes Bunches), it will raise either a TypeError or a KeyError upon
fit without any explicit handling. The other option is also to have a
param: get_field=lambda X, f: X[f] (or get_field=operator.itemgetter if
you like curry).

Double-underscores will break set_params.

I'm happy with this broad solution, and I think it satisfies the most
common cases. If it also means that the proposed example can get merged,
that's great.

My preference is for an additional parameter e.g. select that like
transformer_weights lists the selected fields in parallel with the
transformer list.

On 6 November 2014 10:22, Andreas Mueller notifications@github.com wrote:

You could do arbitrary collections of columns but I don't think I'd want
to do that. The user should do that before passing the data in if they
really want to.


Reply to this email directly or view it on GitHub
#2034 (comment)
.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 6, 2014

I think we can merge the example first and later simplify it using the new api.

@mrterry

This comment has been minimized.

Copy link
Contributor

@mrterry mrterry commented Nov 9, 2014

@jnothman just merged #3680 which involved some mildly contentious boilerplate related to this ticket. Were this ticket to be resolved, that boilerplate would be irrelevant. It looks like you have the shape of a solution worked out here. I'd be happy to implement it unless it is already spoken for.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Nov 9, 2014

Go for it!

@amueller amueller referenced this issue Nov 25, 2014
5 of 8 tasks complete
@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Nov 25, 2014

See #3886.

@poonamrbhide

This comment has been minimized.

Copy link

@poonamrbhide poonamrbhide commented Apr 18, 2015

I am trying to write code for my dataset , with the help of example given in this thread. I am not sure if this is the correct forum to ask this question. However I think it is relevant to the title of discussion:
I want to use Naive Bayes classifier for heterogeneous values.
My features are

  1. AlignmentCount which is simply an integer ( Simply return as it is / DictVectorizer / OneHotEncoder() ? )
  2. Paragraph which is text (CountVectorizer or TFIDF)
  3. List of of some words (? )

I wrote code for combining text + simple numerical features. But it gives me error :
ValueError: blocks[0,:] has incompatible row dimensions
I am not sure , how to use FeatureUnion for combining text vectorizer with simple numerical array. Pasting my code here.

class DictSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        print "in dict constructor"
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):                 
        return data_dict[self.key]

class BillFeatureExtractor(BaseEstimator, TransformerMixin):

    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        print "in bill extractor tranform"
        sectionidlist = []
        sectiontextlist = []
        print len(posts)
        alignment_clustersizelist=[]
        for text in posts:
            sectionid, alignment_clustersize, sectiontext = text.split(':\n\n\n')
            sectionid=sectionid.strip()
          #  print alignment_clustersize
            alignment_clustersize=int(alignment_clustersize)
            sectiontext=sectiontext.strip()
            sectionidlist.append(sectionid)
            sectiontextlist.append(sectiontext)
            alignment_clustersizelist.append(alignment_clustersize)
        return {'sectionid': sectionidlist, 'sectiontext': sectiontextlist ,'alignmentclustersize': alignment_clustersizelist}

pipeline = Pipeline([
    ('billfeatures', BillFeatureExtractor()),

    ('union', FeatureUnion([

        # Pipeline for pulling features from the post's subject line
        ('sectiontext', Pipeline([
            ('selector', DictSelector(key='sectiontext')),
            ('tfidf', TfidfVectorizer(decode_error='ignore')),
        ])),

        ('alignmentclustersize', Pipeline([
            ('selector', DictSelector(key='alignmentclustersize')),
            ('vectorizer', OneHotEncoder()),
            ])),
    ])),

    ('naive', MultinomialNB()),
])

categories = [
    'c1',
    'c2'
] 
train = load_files('f1/',
    categories=categories)
test = load_files('f2/',
    categories=categories)

pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print classification_report(y, test.target)
@dmpe

This comment has been minimized.

Copy link

@dmpe dmpe commented Jan 10, 2016

Hello @amueller ,
is there any chance of moving with this feature somehow ? Your PR still awaits a review from somebody..

@tandon-aman

This comment has been minimized.

Copy link

@tandon-aman tandon-aman commented Jan 15, 2016

I am facing the same problem as commented out by @poonamrbhide. @dmpe please help

@dmpe

This comment has been minimized.

Copy link

@dmpe dmpe commented Jan 15, 2016

@tandon-aman i am wrong person to ask. File a separate bug if needed....

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jan 15, 2016

Sorry, I don't have time to work on the PR at the moment, and there is still some issues about how to exactly implement this feature.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Sep 20, 2016

My point was probably irrelevant musing :) Scikit-learn was built around
the paradigm of, for the most part, handling data that was already a
numeric array. The feature and the target arrays are naturally separate
under this paradigm, and the features are naturally somewhat homogeneous.
In Pandas you might have the target for supervision in the same dataframe,
multiple abstractions (or extents of preparation) of each record, etc. I'm
not claiming much, and I'm doing so through sleepy eyes! Thanks for your
use-cases.

On 20 September 2016 at 23:19, Mats Julian Olsen notifications@github.com
wrote:

I agree that it might be out of place in the Pipeline class, although the
API has served us quite well. I of course primarily want the functionality
in scikit-learn, not push through any preconceived opinion on what the
implementation should look like - so I think you're approach is right.

Could you elaborate the last point about extraneous columns?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2034 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz67yc1FGf7itdEOHan4wwGuAg30OKks5qr91FgaJpZM4AtZy5
.

@mewwts

This comment has been minimized.

Copy link

@mewwts mewwts commented Sep 21, 2016

Thanks for elaborating @jnothman.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Mar 31, 2017

So one particular usecase I keep running into is categorical vs continuous features. You usually don't want to scale or box-cox transform one-hot encoded features, and that's currently really painful to do.

Given that this is something that happens basically every time you want to use scikit-learn (not having categorical variables is a relatively rare special case) I think we should try for this again, also in light of #6967.

I think a common case is actually having multiple columns, like this:

select_categorical = FunctionTransformer(lambda X: X[:, categorical_features])
select_continuous = FunctionTransformer(lambda X: X[:, continuous_features])
fu = make_union(select_categorical, make_pipeline(select_continuous, StandardScaler()))

Using .iloc might help make code that makes sense both for pandas and numpy input.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Apr 1, 2017

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Apr 1, 2017

Is it too magical to allow:

make_union(categorical_features, mape_pipleine(continuous_features, StandardScaler()))

where elements which are -- instead of estimators -- lists, arrays or slices are handled as getters on the 2nd axis? I'm still not sure how to mediate between iloc and loc semantics.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Apr 2, 2017

One advantage of a more specialised implementation may be the ability to inverse_transform an appropriately constructed FeatureUnion..? Or is that too messy in any case?

@ncullen93

This comment has been minimized.

Copy link

@ncullen93 ncullen93 commented Jul 17, 2017

What about having a ParallelFeatureUnion class where the user supplies a list of inputs and a list of transforms, both of equal length in a one-to-one mapping.

E.g.

inputs = [input_a, input_b, input_c]
union = ParallelFeatureUnion([transform_a, tranform_b, transform_c])
union.transform(inputs)

This could work with any arbitrary iterable.

inputs = {'a': input_a, 'b': input_b, 'c': input_c}
union = ParallelFeatureUnion([transform_a, tranform_b, transform_c])
union.transform(inputs)

or

inputs = {'a': input_a, 'b': input_b, 'c': input_c}
union = ParallelFeatureUnion([('a', transform_a), ('b', tranform_b), ('c', transform_c)])
union.transform(inputs)

I realize this doesn't address heterogenous data in pandas dataframes, but it would be nice for data with images + text.

@ncullen93

This comment has been minimized.

Copy link

@ncullen93 ncullen93 commented Jul 17, 2017

Also, I feel like it would be nice to have actual support for heterogenous data - which can only come in a form of an iterable (which seems to break a lot of things). Is there any desire to add more support for multiple inputs in a list, for instance?

For example, imagine I have 3 images and 2 continuous features as input (e.g. three brain images and 2 demographic variables) and want to predict 2 continous features. This is tends to break a lot of things (anything related to cross validation objects).

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# assume three flattened brain image voxels and 3 continuous features
X = [np.random.randn(20,100), np.random.randn(20,110), np.random.randn(20,110)]
X += [np.random.randn(20,3)]
Y = np.random.randn(20, 2)

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, idx):
        self.idx = idx
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x[self.idx]

def get_image_pipeline(idx, n_components=20):
    return  Pipeline([
                ('select', ItemSelector(idx)),
                ('scale', MinMaxScaler((0,1))),
                ('pca', PCA(n_components=n_components))
                ])

def get_demo_pipeline(idx):
    return Pipeline([
            ('select', ItemSelector(idx)),
            ('scale', StandardScaler())
            ])

process_pipe = Pipeline([
                ('features', FeatureUnion([
                                    ('img1', get_image_pipeline(0)),
                                    ('img2', get_image_pipeline(1)),
                                    ('img3', get_image_pipeline(2)),
                                    ('demo', get_demo_pipeline(3))
                                    ])
                ),
                ('model', LinearRegression())])

# breaks
scores = cross_val_score(process_pipe, X, Y)

Instead, I could do this:

def get_image_pipeline(idx, n_components=20):
    return  Pipeline([
                ('scale', MinMaxScaler((0,1))),
                ('pca', PCA(n_components=n_components))
                ])

process_pipe = Pipeline([
                ('features', ParallelFeatureUnion([
                                get_image_pipeline(),
                                get_image_pipeline(),
                                get_image_pipeline(),
                                get_demo_pipeline()
                                ])
                ('model', LinearRegression()))])
@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jul 17, 2017

@ncullen93

This comment has been minimized.

Copy link

@ncullen93 ncullen93 commented Jul 18, 2017

It's a need for iterables of data sources each with different dimensions, essentially. If my inputs are 3 images with 100, 110, 120 features (pixels) respectively, and 2 clinical variables, it all works with a FeatureUnion pipeline when I use the standard fit and score functions. However, it breaks with cross_val_score and GridSearchCV -- unnecessarily, in my opinion, since the pipeline works but cross val functions error out on a few little helper functions related to calculating number of samples that can't handle lists but should (specifically utils.validation._num_samples and utils.validation.check_consistent_length.)

It's like this:

pipeline = feature_union_that_takes_multiple_inputs

# works
pipeline.fit([array_1, array_2, array_3], y_array)
pipeline.predict([array_1, array_2, array_3])
pipeline.score([array_1, array_2, array_3], y_array)

# doesnt work - only because helper functions error out
cross_val_score(pipeline [array_1, array_2, array_3], y_array)

Doesn't it seem like cross_val_score should work if fit, predict, and score work?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 18, 2017

_num_samples should work on lists. Can you give a minimum reproducing example? The length of the list should be the number of samples, which I guess it is not for you. cross_val_score needs to know which axis to slice along, and for lists there's only one possibility without introspection. You can make your code work if you make the list into an array and transpose.

@ncullen93

This comment has been minimized.

Copy link

@ncullen93 ncullen93 commented Jul 18, 2017

Right, I get what you're saying - but it's more natural to have a list of the different data sources together instead of split by samples - e.g. [(n_samples, n_features1), (n_samples, n_features2), ...] and not possible to put it into an array (I think?) when the # of features of each modalities are different as below:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, idx):
        self.idx = idx
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x[self.idx]

class Flatten(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x.reshape(x.shape[0], -1)

def get_image_pipeline(idx, n_components=20):
    return  Pipeline([
                ('select', ItemSelector(idx)),
                ('flat', Flatten()),
                ('scale', MinMaxScaler((0,1))),
                ('pca', PCA(n_components=n_components))
                ])

def get_demo_pipeline(idx):
    return Pipeline([
            ('select', ItemSelector(idx)),
            ('scale', StandardScaler())
            ])

pipe = Pipeline([
                ('features', FeatureUnion([
                                    ('img1', get_image_pipeline(0)),
                                    ('img2', get_image_pipeline(1)),
                                    ('img3', get_image_pipeline(2)),
                                    ('demo', get_demo_pipeline(3))
                                    ])
                ),
                ('model', LinearRegression())])

# assume you have 3 brain images of different size and 3 demographic variables
X = [np.random.randn(20,56,56,56), np.random.randn(20,50,50,50), np.random.randn(20,30,30,30), np.random.randn(20,3)]
# assume trying to predict 2 clinical variables
Y = np.random.randn(20, 2)

# works
pipe.fit(X, Y)
ypred = pipe.predict(X)
score = pipe.score(X, Y)

# doesnt work
scores = cross_val_score(pipe, X, Y)
@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jul 18, 2017

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 25, 2017

It took me too long to understand you meant "code snippet" lol.
@ncullen93 for cross_val_score to work you need to adhere to the API. And what you want to do is possible within the API if you know the number of features in each group.

Trying to slice nested data structured like the one you want to use would probably soon lead to issues. How should it determine which axis is the sample axis in an n-dimensional array? The "samples axis is always first" convention in sklearn gets rid of that problem.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Jul 26, 2017

@FrugoFruit90

This comment has been minimized.

Copy link

@FrugoFruit90 FrugoFruit90 commented Oct 12, 2017

@jnothman I am trying to fix my use case using your example at the top of the thread. I modified your GetItemTransformer class as follows:

class GetItemTransformer(TransformerMixin):
    def __init__(self, field):
        self.field = field

    # assume default fit()
    def transform(self, X):
        return X[self.field]

This is the pipeline that I currently have:

Pipeline(steps=[('remove_target_columns', ColumnRemover),
                ("['xgb', 'nn', 'svm']", FeatureUnion(n_jobs=1, transformer_list=[('xgb', ModelTransformer(xgb_pipeline),
                                                                                  ('nn', ModelTransformer(nn_pipeline)),
                                                                                  ('svm', ModelTransformer(svm_pipeline)],
                                                      transformer_weights=None)),
                ('voting', EnsembleBinaryClassifier(mode='average', weights=None))
                ])

Each one of 'xgb', 'nn' and 'svm' is a pipeline (additionally wrapped with ModelTransformer, as otherwise I get the following error: "All estimators should implement fit and transform. (...) (type <class 'sklearn.pipeline.Pipeline'>) doesn't"). Here is the ModelTransformer class:

class ModelTransformer(TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return pd.DataFrame(self.model.predict_proba(X))  # change predict_proba to predict if you want predictions

However, I am still unable to pass parameters separately to the aforementioned pipelines. In particular, I am unable to pass callbacks to my keras.wrappers.scikit_learn.KerasClassifier:

fit_params = {
    "['xgb', 'nn', 'svm']__nn__transform__callbacks": [earlyStopping, modelCheck]
              }

The 'nn' pipeline has the following structure:

Pipeline(FeatureUnion([('get', GetItemTransformer('nn'), ('transform', KerasClassifier())]))

What do I need to do to make this work?
Currently, I'm getting the following error:

  File "C:/Users/jan.zysko/Documents/powroty_pracuj\powroty_models.py", line 22, in train_model
    return pipeline.fit(data, data[target_col], **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 734, in fit_transform
    for name, trans, weight in self._iter())
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 326, in __init__
    self.results = batch()
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 577, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\base.py", line 497, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:/Users/jan.zysko/Documents/powroty_pracuj\powroty_models.py", line 81, in fit
    self.model.fit(*args, **kwargs)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 228, in _fit
    fit_params_steps[step][param] = pval
KeyError: 'nn'

Debugging suggests that what happens is that the parameters meant for the neural network still get passed to xgboost.
Any help? It seems that the idea of an example died, although @amueller suggested that an example might be useful, rightly so in my humble opinion.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Oct 15, 2017

@FrugoFruit90

This comment has been minimized.

Copy link

@FrugoFruit90 FrugoFruit90 commented Oct 16, 2017

Thanks for the example @jnothman !

Does it really allow specifying the underlying parameters, though? (I understood that this was the initial motivation, at least in September 2013 😉)

I tried to check that, and changed the TextStats estimator to simulate having to specify a parameter at the moment of fitting (as is e.g. with Keras callbacks) :

class TextStats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x,, y=None, fit_parameter="nothing_here"):
        print(fit_parameter)
        return self

I then called the fit method on pipeline to pass the intended parameter:
pipeline.fit(train.data, train.target, union__body_stats__stats__burn='fit_parameter')

This was the result:

Traceback (most recent call last):
  File "C:/Users/jan.zysko/Downloads/hetero_feature_union.py", line 179, in <module>
    pipeline.fit(train.data, train.target, union__body_stats__stats__burn='fit_parameter')
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 734, in fit_transform
    for name, trans, weight in self._iter())
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 326, in __init__
    self.results = batch()
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 577, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 301, in fit_transform
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\jan.zysko\AppData\Local\Continuum\Anaconda3\envs\MLpred\lib\site-packages\sklearn\pipeline.py", line 228, in _fit
    fit_params_steps[step][param] = pval
KeyError: 'body_stats'

It seems that the substeps from FeatureUnion are not recognized as "proper" steps, so the information about my fit parameter is not sent "down". Am I doing something wrong?


Sidenote about my "All estimators should implement fit and transform. (...) (type <class 'sklearn.pipeline.Pipeline'>) doesn't" error:

Upon closer inspection I found out the reason - some of my classifiers are not sklearn objects, but simply sklearn-compatible (supposedly 😉) wrappers (eg. XGBoost), and they don't implement a .transform() method, which causes problems similarly as in #8414

So the error talks about Pipeline, but the problem happens underneath. What I did to solve it was wrap all the "foreign" estimators in the ModelTransformer class. This fixes this particular error, and the pipeline works correctly without fit_parameters. I really need to pass them, though, using Keras simply sucks without those 😢

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Oct 16, 2017

@FrugoFruit90

This comment has been minimized.

Copy link

@FrugoFruit90 FrugoFruit90 commented Oct 17, 2017

Well e.g. in the XGBoost API there are some arguments that are sent to fit rather than being set as class parameters, e.g. eval_set, eval_metric etc.:
http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Oct 17, 2017

@FrugoFruit90

This comment has been minimized.

Copy link

@FrugoFruit90 FrugoFruit90 commented Oct 17, 2017

OK, so the kind of thing I am trying to do is and will be impossible, right?
Thank you for your time anyway, always very educational.

@jnothman

This comment has been minimized.

Copy link
Member Author

@jnothman jnothman commented Oct 17, 2017

@FrugoFruit90

This comment has been minimized.

Copy link

@FrugoFruit90 FrugoFruit90 commented Oct 17, 2017

Well, if you or anyone else could think of a workaround for now, even the most hacky one, I would love to see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.