Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas in, Pandas out? #5523

Closed
naught101 opened this issue Oct 22, 2015 · 64 comments
Closed

Pandas in, Pandas out? #5523

naught101 opened this issue Oct 22, 2015 · 64 comments

Comments

@naught101
Copy link

At the moment, it's possible to use a pandas dataframe as an input for most sklearn fit/predict/transform methods, but you get a numpy array out. It would be really nice to be able to get data out in the same format you put it in.

This isn't perfectly straightforward, because if your Dataframe contains columns that aren't numeric, then the intermediate numpy arrays will cause sklearn to fail, because they wil be dtype=object, instead of dtype=float. This can be solved by having a Dataframe->ndarray transformer, that maps the non-numeric data to numeric data (e.g. integers representing classes/categories). sklearn-pandas already does this, although it currently doesn't have an inverse_transform, but that shouldn't be hard to add.

I feel like a transform like this would be really useful to have in sklearn - it's the kind of thing that anyone working with datasets with multiple data types would find useful. What would it take to get something like this into sklearn?

@jnothman
Copy link
Member

Scikit-learn was designed to work with a very generic input format. Perhaps the world around scikit-learn has changed a lot since in ways that make Pandas integration more important. It could still largely be supplied by third-party wrappers.

But apart from the broader question, I think you should try to give examples of how Pandas-friendly output from standard estimators will differ and make a difference to usability. Examples I can think of:

  • all methods could copy the index from the input
  • transformers should output appropriately-named columns
  • multiclass predict_proba can label columns with class names

@naught101
Copy link
Author

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data (pass mixed dataframe, transform categorical columns with LabelBinarizer, the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to pass a dataframe, and only use a subset of the data, and arbitrarily transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn works by default.

@jnothman
Copy link
Member

I don't think it can be implemented nicely as a transformer. It would be
one or more metaestimators or mixins. I think they should be initially
implemented externally and demonstrated as useful

On 22 October 2015 at 17:40, naught101 notifications@github.com wrote:

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged
    variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data
    (pass mixed dataframe, transform categorical columns with LabelBinarizer,
    the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to
    pass a dataframe, and only use a subset of the data, and arbitrarily
    transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn
works by default.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@amueller
Copy link
Member

Making "pandas in" better was kind of the idea behind the column transformer PR #3886. Maybe I should have looked more closely into what sklearn-pandas is already doing. I'm not entirely sure what the best way forward is there.

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now. Maybe @jnothman remembers. I would really like that, though it would require major surgery with the input validation to preserve the column names :-/

Related #4196

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 23, 2015 via email

@amueller
Copy link
Member

True, but that I think would be nice ;)

One question is maybe whether we want this only in pipelines or everywhere. If we restrict it to pipelines, the input validation surgery would be less big. But I'm not sure how useful it would be.

@kastnerkyle
Copy link
Member

You can always do a pipeline with just one thing in it, right? So we kind of handle all cases (though it is hacky in the limit of 1 object) by restricting to just pipeline at first...

@sinhrks
Copy link
Contributor

sinhrks commented Oct 23, 2015

+1. Starting with pipeline sounds nice, and cover all transformer in next step.

I also have an impl with pandas and sklearn integration, which can revert columns info via inverse_transform (dirty hack though...)

http://pandas-ml.readthedocs.org/en/latest/sklearn.html

@GaelVaroquaux
Copy link
Member

• the index can be really useful, e.g. for creating timed lagged variables
(e.g lag 1 day, on daily data with some missing days)

I am a bit stupid, but aren't talking about something in the sample
direction here, rather than the feature direction?

• sklearn regressors could be used transparently with categorical data (pass
mixed dataframe, transform categorical columns with LabelBinarizer, the
inverse_transform it back).

• sklearn-pandas already provides a nice interface that allows you to pass a
dataframe, and only use a subset of the data, and arbitrarily transform
individual columns.

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it? Rather than attempting a
modification on all the objects of scikit-learn (which is a risky
endeavior), we could first implement this transformer (I believe that
@amueller has this on his mind).

@naught101
Copy link
Author

sample direction here, rather than the feature direction?

Yep.

OK, but that's all at the level of one transformer that takes Pandas in, and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with a wrapper that dealt with X and y as dataframes. I don't see an obvious reason to screw with sklearn's internals.

@GaelVaroquaux
Copy link
Member

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with
a wrapper that dealt with X and y as dataframes. I don't see an obvious reason
to screw with sklearn's internals.

Then we are on the same page. I do think that @amueller has ideas about
this, and we might see some discussion, and maybe code soon.

@jnothman
Copy link
Member

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now.

#5172

@jnothman
Copy link
Member

jnothman commented Nov 2, 2015

A note: I had wondered if one would only want to wrap the outermost estimator in an ensemble to provide this functionality to a user. I think the answer is: no, one wants to wrap atomic transformers too, to allow for dataframe-aware transformers within a pipeline (why not?). Without implementing this as a mixin, I think you're going to get issues with unnecessary parameter prefixing or else problems cloning (as in #5080).

@languitar
Copy link

👍

@dwyatte
Copy link

dwyatte commented Jan 14, 2016

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
    """
    Joins X with ensure_index's index or ensure_columns's columns when avaialble
    """
    if ensure_index is not None:
        if ensure_columns is not None:
            if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
        else:
            if type(ensure_index) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index)
    return X

I then create wrappers around sklearn's estimators that call this function on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler 
class StandardScaler(_StandardScaler):
    def transform(self, X):
        Xt = super(StandardScaler, self).transform(X)
        return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the existing sklearn design while also preserving the speed of computation (math operations and indexing on dataframes are up to 10x slower than numpy arrays, http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/). Unfortunately, it's a lot of tedious work to add to each estimator that could utilize it.

@jnothman
Copy link
Member

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of computation
(math operations and indexing on dataframes are up to 10x slower than numpy
arrays). Unfortunately, it's a lot of tedious work to add to each estimator
that could utilize it.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@naught101
Copy link
Author

Or just something that wraps a pipeline/estimator, no?

I don't really understand why you'd call a function like that "check_*" when it's doing far more than just checking though...

On 14 January 2016 10:45:44 am CST, Joel Nothman notifications@github.com wrote:

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com
wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns
when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and
type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index,
columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this
function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can
just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of
computation
(math operations and indexing on dataframes are up to 10x slower than
numpy
arrays). Unfortunately, it's a lot of tedious work to add to each
estimator
that could utilize it.


Reply to this email directly or view it on GitHub

#5523 (comment)
.


Reply to this email directly or view it on GitHub:
#5523 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@dwyatte
Copy link

dwyatte commented Jan 15, 2016

I'm not sure if Pipeline is the right place to start because all column name inheritance is estimator-specific e.g., scalers should inherit the column names of the input dataframe whereas models like PCA, should not. Feature selection estimators should inherit specific column names, but that is another problem, probably more related to #2007.

Is it always the case that n_rows of all arrays is preserved during transform? If so, just inheriting the index of the input (if it exists) seems safe, but I'm not sure that getting a dataframe with default column names (e.g., [0, 1, 2, 3, ...]) is better than the current behavior from an end-user perspective, but if an explicit wrapper/meta-estimator is used, then at least the user will know what to expect.

Also, agreed that check_* is a poor name -- I was doing quite a bit more validation in my function, and just stripped out the dataframe logic to post here.

@amueller
Copy link
Member

I think pipeline would be the place to start, though we would need to add something to all estimators that map the column names appropriately.

@makmanalp
Copy link

transformers should output appropriately-named columns @naught101

though it would require major surgery with the input validation to preserve the column names :-/ @amueller

Not only input validation: every transform would have to describe what it does to the input columns. @GaelVaroquaux

Has anyone thought about the mechanics of how to pass the names around, from transformer to transformer, and perhaps how to track the provenance? Where would one store this?

A friend of mine, @cbrummitt , has a similar problem, where each column of his design matrix is a functional form (e.g. x^2, x^3, x_1^3x_2^2, represented as sympy expressions), and he has transformers that act similarly to PolynomialFeatures, that can take in functional forms and generate more ones based on that. But he's using sympy to take the old expressions and generate new ones, and storing the expressions as string labels doesn't cut it, and gets complicated when you layer the function transformations. He could do all this outside the pipeline, but then he doesn't get the benefit of GridSearch, etc.

I guess the more general version of our question is, how do you have some information that would be passed from transformer to transformer that is NOT the data itself? I can't come up with a great way without having pipeline-global state or having each transformer / estimator know about the previous ones, or having each step return multiple things, or something.

We then also came up with the idea to modify pipeline to keep track of this, you'd have to change _fit() and _transform() and perhaps a few other things. That seems like our best option.

This sounds crazy but what it feels like is that we really want our data matrix to be sympy expressions , and each transformation generates new expressions? This is gross, check_array() stops it from happening, and it'd make other steps down the pipeline angry.

@amueller
Copy link
Member

amueller commented Oct 7, 2016

see #6425 for the current idea.

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

All you want is a mapping, for each transformer (including a pipeline of
transformers), from input feature names to output feature names (or some
structured representation of the transformations, which I suspect is more
engineering than we're going to get). That's what #6425 provides.

On 8 October 2016 at 03:42, Andreas Mueller notifications@github.com
wrote:

see #6425 #6425 for
the current idea.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5523 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65fBsMwqmkDq3DEjMSUC-W-nfn9zks5qxnZxgaJpZM4GThGc
.

@makmanalp
Copy link

We'll look into this, thank you!

@jimmywan
Copy link
Contributor

Can someone provide a general update on the state of the world wrt this issue?

Will pandas DataFrame support always be a YMMV thing?
Guidance on what is/isn't considered safe for use with a pandas DataFrame instead of just an ndarray would be helpful. Perhaps something along the lines of the following (MADE UP EXAMPLE TABLE):

module/category can safely consume pandas DataFrame
sklearn.pipeline SAFE
sklearn.feature_selection SAFE
regressors YMMV
sklearn.feature_extraction NOT SAFE, no plan to implement
etc ...

Right now, I'm not sure of an approach other than "just try it and see if it throws exceptions".

We've tested a handful of hand-coded examples that seem to work just fine accepting a pandas DataFrame, but can't help thinking this will inevitably stop working right when we decide we need to make a seemingly trivial pipeline component swap... at which point everything falls down like a house of cards in a cryptic stack trace.

My initial thought process was to create a replacement pipeline object that can consume a pandas DataFrame that auto-generated wrappers for standard scikit-learn components to convert input/output DataFrame objects into numpy ndarray objects as necessary. That way I can write my own custom Selectors/Transformers so I can make use of pandas DataFrame primitives, but that seems a bit heavy handed. Especially so, if we're on the cusp of having "official" support for them.

I've been following a few different PRs, but it's hard to get a sense for which ones are abandoned and/or which reflect the current thinking:
Example:
#6425 (referenced Oct 2016 above in this thread)
#9012 (obvious overlaps with sklearn-pandas, but annotated as experimental?)
#3886 (superceded by #9012 ?)

@amueller
Copy link
Member

This hinges critically on what you mean by "Can safely consume pandas DataFrame". If you mean a DataFrame containing only float numbers, we guarantee that everything will work. If there is even a single string anywhere, nothing will work.

I think any scikit-learn estimator returning a dataframe for any non-trivial (or maybe even trivial) operation is something that might never happen (though It would like it to).

@amueller
Copy link
Member

amueller commented Jun 21, 2017

#9012 will happen and will become stable, the PR is a first iteration (or 10th iteration, if you count non-merged ones ;)
#6425 is likely to happen, though it is not entirely related to pandas.
#3886 is indeed superceded by #9012

@jnothman
Copy link
Member

jnothman commented Jun 21, 2017 via email

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Oct 24, 2019

Good numerics are hard to implement on pandas dataframes. They are just
not meant for that, in particular for multivariate operations (numerical
operations across columns).

Maybe I am missing something, but wouldn't it be possible to unwrap the DataFrame (using df.values), do the computations and then wrap back to a new DataFrame ?

That is basically what I do manually between steps, and the only thing preventing the use of a Pipeline.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 24, 2019 via email

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Oct 24, 2019

In general no: it might not work (heterogeneous columns)

I think that Column Transformers and such can handle it indicidually.

it will lead to a lot of memory copies.

I understand that there are difficult design & implementation choices to make, and that is a sound argument.

However, I don't understand why you would argue that it is not a good idea to improve the way sklearn supports column meta data.

Allowing for instance to ingest a df with features, add a column thanks to a predictor, do more data manipulations, do another predict, all that in a Pipeline, is something that would be useful because it would (for instance) allow hyper parameters optimization in a much better integrated and elegant way.

Doing it with or without pandas is just a suggestion since it is the most common, easy and popular way to manipulate data, and that I don't see any benefits to rewrite more than what they did.

It would be up to the user to decide not to use this workflow when optimizing performances.

@jnothman
Copy link
Member

jnothman commented Oct 24, 2019 via email

@hermidalc
Copy link
Contributor

I think the best solution would be to support pandas dataframes in and out for the sample and feature properties and properly passing and slicing them into train and test fit/transform. That would solve most uses cases while keeping the speed of the data matrix X as numpy arrays.

@adrinjalali
Copy link
Member

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

@hermidalc
Copy link
Contributor

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

I hope I was clear in my previous post. I believe scikit-learn doesn't currently need to support pandas dataframes for X data, keep them as speedy numpy arrays. But what would solve many use cases is full support through the framework for pandas dataframes for metadata, i.e. sample properties and feature properties. This shouldn't be a performance burden even for memory copies as these two data structures will be minor compared to X and really only subsetting will be done on these.

@adrinjalali
Copy link
Member

Yes, those changes do help in many usecases, and we're working on them. But this issue is beyond that: #5523 (comment)

@NicolasHug
Copy link
Member

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

@hermidalc
Copy link
Contributor

hermidalc commented Oct 25, 2019

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

Yes, full support for sample properties and feature properties as pandas dataframes. Discussion is already happening on sample properties and feature names in other PRs and issues, e.g. here #9566 and #14315

@183amir
Copy link

183amir commented Apr 28, 2020

I've read up on this issue and looks like there are two major blockers here:

  1. Question: Guaranteed zero-copy round-trip from numpy? pandas-dev/pandas#27211
  2. That pandas does not handle N-D arrays.

Have you considered adding support for xarrays instead? They don't have those limitations of pandas.

X = np.arange(10).reshape(5, 2)
assert np.asarray(xr.DataArray(X)) is X
assert np.asarray(xr.Dataset({"data": (("samples", "features"), X)}).data).base is X.base

There is a package called sklearn-xarray: https://phausamann.github.io/sklearn-xarray/content/wrappers.html that wraps scikit estimators to handle xarrays as input and output but that seems to have gone unmaintained for years. However, I wonder if wrappers are the way to go here.

@thomasjpfan
Copy link
Member

xarray is actively being considered. It is being prototyped and worked on here: #16772 There is a usage notebook on what the API would look like in the PR.

(I will get back to it after we finish with the 0.23 release)

@gioxc88
Copy link

gioxc88 commented Jun 12, 2020

I am also very interested in this feature.
It would solve infinite problems. Currently this is the solution I am using.
I wrote a wrapper around the sklearn.preprocessing module, which I called sklearn_wrapper

So instead of importing from sklearn.preprocessing I import from sklearn_wrapper.
For example:

# this
from sklearn.preprocessing import StandardScaler 
# becomes 
from sklearn_wrapper import StandardScaler

Below the implementation of this module. Try it out and let me know what you guys think

from functools import wraps
from itertools import chain

import pandas as pd
from sklearn import preprocessing, compose, feature_selection, decomposition
from sklearn.compose._column_transformer import _get_transformer_list

modules = (preprocessing, feature_selection, decomposition)


def base_wrapper(Parent):
    class Wrapper(Parent):

        def transform(self, X, **kwargs):
            result = super().transform(X, **kwargs)
            check = self.check_out(X, result)
            return check if check is not None else result

        def fit_transform(self, X, y=None, **kwargs):
            result = super().fit_transform(X, y, **kwargs)
            check = self.check_out(X, result)
            return check if check is not None else result

        def check_out(self, X, result):
            if isinstance(X, pd.DataFrame):
                result = pd.DataFrame(result, index=X.index, columns=X.columns)
                result = result.astype(X.dtypes.to_dict())
            return result

        def __repr__(self):
            name = Parent.__name__
            tmp = super().__repr__().split('(')[1]
            return f'{name}({tmp}'

    Wrapper.__name__ = Parent.__name__
    Wrapper.__qualname__ = Parent.__name__

    return Wrapper


def base_pca_wrapper(Parent):
    Parent = base_wrapper(Parent)

    class Wrapper(Parent):
        @wraps(Parent)
        def __init__(self, *args, **kwargs):
            self._prefix_ = kwargs.pop('prefix', 'PCA')
            super().__init__(*args, **kwargs)

        def check_out(self, X, result):
            if isinstance(X, pd.DataFrame):
                columns = [f'{self._prefix_}_{i}' for i in range(1, (self.n_components or X.shape[1]) + 1)]
                result = pd.DataFrame(result, index=X.index, columns=columns)
            return result

    return Wrapper


class ColumnTransformer(base_wrapper(compose.ColumnTransformer)):

    def check_out(self, X, result):
        if isinstance(X, pd.DataFrame):
            return pd.DataFrame(result, index=X.index, columns=self._columns[0]) if self._remainder[1] == 'drop' \
                else pd.DataFrame(result, index=X.index, columns=X.columns). \
                astype(self.dtypes.iloc[self._remainder[-1]].to_dict())


class SelectKBest(base_wrapper(feature_selection.SelectKBest)):

    def check_out(self, X, result):
        if isinstance(X, pd.DataFrame):
            return pd.DataFrame(result, index=X.index, columns=X.columns[self.get_support()]). \
                astype(X.dtypes[self.get_support()].to_dict())


def make_column_transformer(*transformers, **kwargs):
    n_jobs = kwargs.pop('n_jobs', None)
    remainder = kwargs.pop('remainder', 'drop')
    sparse_threshold = kwargs.pop('sparse_threshold', 0.3)
    verbose = kwargs.pop('verbose', False)
    if kwargs:
        raise TypeError('Unknown keyword arguments: "{}"'
                        .format(list(kwargs.keys())[0]))
    transformer_list = _get_transformer_list(transformers)
    return ColumnTransformer(transformer_list, n_jobs=n_jobs,
                             remainder=remainder,
                             sparse_threshold=sparse_threshold,
                             verbose=verbose)


def __getattr__(name):
    if name not in __all__:
        return

    for module in modules:
        Parent = getattr(module, name, None)
        if Parent is not None:
            break

    if Parent is None:
        return

    if module is decomposition:
        Wrapper = base_pca_wrapper(Parent)
    else:
        Wrapper = base_wrapper(Parent)

    return Wrapper


__all__ = [*[c for c in preprocessing.__all__ if c[0].istitle()],
           *[c for c in decomposition.__all__ if c[0].istitle()],
           'SelectKBest']


def __dir__():
    tmp = dir()
    tmp.extend(__all__)
    return tmp

@samosun
Copy link

samosun commented Dec 9, 2020

koaning/scikit-lego#304 provided another solution by Hot-fixing on the sklearn.pipeline.FeatureUnion

@premopie
Copy link

premopie commented Mar 2, 2021

koaning/scikit-lego#304 provided another solution by Hot-fixing on the sklearn.pipeline.FeatureUnion

I like the solution with pandify decorator - it's very ellegant IMHO (provided that the columns remain unchanged during transformation and there is no need to pickle the object). Thanks for letting us know!

@avm19
Copy link
Contributor

avm19 commented Apr 5, 2022

Mentioning #23001 here, because this is the most popular issue on the topic.

@lorentzenchr
Copy link
Member

I guess this can be closed with #23734.

@amueller
Copy link
Member

@premopie @183amir @avm19 @gioxc88 @naught101 it would be awesome to get your feedback on the feature we implemented and whether it addresses your use-cases.

@avm19
Copy link
Contributor

avm19 commented Oct 18, 2022

@premopie @183amir @avm19 @gioxc88 @naught101 it would be awesome to get your feedback on the feature we implemented and whether it addresses your use-cases.

I forgot what was the serious use-case I had, but this feature works in a toy example. Here are the changes I made. So yeah, it made me a bit happier :)

@amueller
On a related note, I haven't figured out the state of affairs with predict* methods. Is there a mix-in I should use in my custom non-transformer to automatically re-attach the input row index to the output? Is this being implemented, discussed or shelved? I saw some mentions here and there, but couldn't put all pieces together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests