Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas in, Pandas out? #5523

Open
naught101 opened this issue Oct 22, 2015 · 60 comments
Open

Pandas in, Pandas out? #5523

naught101 opened this issue Oct 22, 2015 · 60 comments

Comments

@naught101
Copy link

@naught101 naught101 commented Oct 22, 2015

At the moment, it's possible to use a pandas dataframe as an input for most sklearn fit/predict/transform methods, but you get a numpy array out. It would be really nice to be able to get data out in the same format you put it in.

This isn't perfectly straightforward, because if your Dataframe contains columns that aren't numeric, then the intermediate numpy arrays will cause sklearn to fail, because they wil be dtype=object, instead of dtype=float. This can be solved by having a Dataframe->ndarray transformer, that maps the non-numeric data to numeric data (e.g. integers representing classes/categories). sklearn-pandas already does this, although it currently doesn't have an inverse_transform, but that shouldn't be hard to add.

I feel like a transform like this would be really useful to have in sklearn - it's the kind of thing that anyone working with datasets with multiple data types would find useful. What would it take to get something like this into sklearn?

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 22, 2015

Scikit-learn was designed to work with a very generic input format. Perhaps the world around scikit-learn has changed a lot since in ways that make Pandas integration more important. It could still largely be supplied by third-party wrappers.

But apart from the broader question, I think you should try to give examples of how Pandas-friendly output from standard estimators will differ and make a difference to usability. Examples I can think of:

  • all methods could copy the index from the input
  • transformers should output appropriately-named columns
  • multiclass predict_proba can label columns with class names

@naught101
Copy link
Author

@naught101 naught101 commented Oct 22, 2015

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data (pass mixed dataframe, transform categorical columns with LabelBinarizer, the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to pass a dataframe, and only use a subset of the data, and arbitrarily transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn works by default.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 22, 2015

I don't think it can be implemented nicely as a transformer. It would be
one or more metaestimators or mixins. I think they should be initially
implemented externally and demonstrated as useful

On 22 October 2015 at 17:40, naught101 notifications@github.com wrote:

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged
    variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data
    (pass mixed dataframe, transform categorical columns with LabelBinarizer,
    the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to
    pass a dataframe, and only use a subset of the data, and arbitrarily
    transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn
works by default.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@amueller
Copy link
Member

@amueller amueller commented Oct 23, 2015

Making "pandas in" better was kind of the idea behind the column transformer PR #3886. Maybe I should have looked more closely into what sklearn-pandas is already doing. I'm not entirely sure what the best way forward is there.

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now. Maybe @jnothman remembers. I would really like that, though it would require major surgery with the input validation to preserve the column names :-/

Related #4196

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 23, 2015

@amueller
Copy link
Member

@amueller amueller commented Oct 23, 2015

True, but that I think would be nice ;)

One question is maybe whether we want this only in pipelines or everywhere. If we restrict it to pipelines, the input validation surgery would be less big. But I'm not sure how useful it would be.

@kastnerkyle
Copy link
Member

@kastnerkyle kastnerkyle commented Oct 23, 2015

You can always do a pipeline with just one thing in it, right? So we kind of handle all cases (though it is hacky in the limit of 1 object) by restricting to just pipeline at first...

@sinhrks
Copy link
Contributor

@sinhrks sinhrks commented Oct 23, 2015

+1. Starting with pipeline sounds nice, and cover all transformer in next step.

I also have an impl with pandas and sklearn integration, which can revert columns info via inverse_transform (dirty hack though...)

http://pandas-ml.readthedocs.org/en/latest/sklearn.html

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2015

• the index can be really useful, e.g. for creating timed lagged variables
(e.g lag 1 day, on daily data with some missing days)

I am a bit stupid, but aren't talking about something in the sample
direction here, rather than the feature direction?

• sklearn regressors could be used transparently with categorical data (pass
mixed dataframe, transform categorical columns with LabelBinarizer, the
inverse_transform it back).

• sklearn-pandas already provides a nice interface that allows you to pass a
dataframe, and only use a subset of the data, and arbitrarily transform
individual columns.

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it? Rather than attempting a
modification on all the objects of scikit-learn (which is a risky
endeavior), we could first implement this transformer (I believe that
@amueller has this on his mind).

@naught101
Copy link
Author

@naught101 naught101 commented Oct 24, 2015

sample direction here, rather than the feature direction?

Yep.

OK, but that's all at the level of one transformer that takes Pandas in, and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with a wrapper that dealt with X and y as dataframes. I don't see an obvious reason to screw with sklearn's internals.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2015

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with
a wrapper that dealt with X and y as dataframes. I don't see an obvious reason
to screw with sklearn's internals.

Then we are on the same page. I do think that @amueller has ideas about
this, and we might see some discussion, and maybe code soon.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 24, 2015

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now.

#5172

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 2, 2015

A note: I had wondered if one would only want to wrap the outermost estimator in an ensemble to provide this functionality to a user. I think the answer is: no, one wants to wrap atomic transformers too, to allow for dataframe-aware transformers within a pipeline (why not?). Without implementing this as a mixin, I think you're going to get issues with unnecessary parameter prefixing or else problems cloning (as in #5080).

@languitar
Copy link

@languitar languitar commented Nov 27, 2015

👍

@dwyatte
Copy link

@dwyatte dwyatte commented Jan 14, 2016

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
    """
    Joins X with ensure_index's index or ensure_columns's columns when avaialble
    """
    if ensure_index is not None:
        if ensure_columns is not None:
            if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
        else:
            if type(ensure_index) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index)
    return X

I then create wrappers around sklearn's estimators that call this function on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler 
class StandardScaler(_StandardScaler):
    def transform(self, X):
        Xt = super(StandardScaler, self).transform(X)
        return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the existing sklearn design while also preserving the speed of computation (math operations and indexing on dataframes are up to 10x slower than numpy arrays, http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/). Unfortunately, it's a lot of tedious work to add to each estimator that could utilize it.

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 14, 2016

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of computation
(math operations and indexing on dataframes are up to 10x slower than numpy
arrays). Unfortunately, it's a lot of tedious work to add to each estimator
that could utilize it.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@naught101
Copy link
Author

@naught101 naught101 commented Jan 14, 2016

Or just something that wraps a pipeline/estimator, no?

I don't really understand why you'd call a function like that "check_*" when it's doing far more than just checking though...

On 14 January 2016 10:45:44 am CST, Joel Nothman notifications@github.com wrote:

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com
wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns
when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and
type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index,
columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this
function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can
just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of
computation
(math operations and indexing on dataframes are up to 10x slower than
numpy
arrays). Unfortunately, it's a lot of tedious work to add to each
estimator
that could utilize it.


Reply to this email directly or view it on GitHub

#5523 (comment)
.


Reply to this email directly or view it on GitHub:
#5523 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@dwyatte
Copy link

@dwyatte dwyatte commented Jan 15, 2016

I'm not sure if Pipeline is the right place to start because all column name inheritance is estimator-specific e.g., scalers should inherit the column names of the input dataframe whereas models like PCA, should not. Feature selection estimators should inherit specific column names, but that is another problem, probably more related to #2007.

Is it always the case that n_rows of all arrays is preserved during transform? If so, just inheriting the index of the input (if it exists) seems safe, but I'm not sure that getting a dataframe with default column names (e.g., [0, 1, 2, 3, ...]) is better than the current behavior from an end-user perspective, but if an explicit wrapper/meta-estimator is used, then at least the user will know what to expect.

Also, agreed that check_* is a poor name -- I was doing quite a bit more validation in my function, and just stripped out the dataframe logic to post here.

@amueller
Copy link
Member

@amueller amueller commented Jan 15, 2016

I think pipeline would be the place to start, though we would need to add something to all estimators that map the column names appropriately.

@makmanalp
Copy link

@makmanalp makmanalp commented Oct 7, 2016

transformers should output appropriately-named columns @naught101

though it would require major surgery with the input validation to preserve the column names :-/ @amueller

Not only input validation: every transform would have to describe what it does to the input columns. @GaelVaroquaux

Has anyone thought about the mechanics of how to pass the names around, from transformer to transformer, and perhaps how to track the provenance? Where would one store this?

A friend of mine, @cbrummitt , has a similar problem, where each column of his design matrix is a functional form (e.g. x^2, x^3, x_1^3x_2^2, represented as sympy expressions), and he has transformers that act similarly to PolynomialFeatures, that can take in functional forms and generate more ones based on that. But he's using sympy to take the old expressions and generate new ones, and storing the expressions as string labels doesn't cut it, and gets complicated when you layer the function transformations. He could do all this outside the pipeline, but then he doesn't get the benefit of GridSearch, etc.

I guess the more general version of our question is, how do you have some information that would be passed from transformer to transformer that is NOT the data itself? I can't come up with a great way without having pipeline-global state or having each transformer / estimator know about the previous ones, or having each step return multiple things, or something.

We then also came up with the idea to modify pipeline to keep track of this, you'd have to change _fit() and _transform() and perhaps a few other things. That seems like our best option.

This sounds crazy but what it feels like is that we really want our data matrix to be sympy expressions , and each transformation generates new expressions? This is gross, check_array() stops it from happening, and it'd make other steps down the pipeline angry.

@amueller
Copy link
Member

@amueller amueller commented Oct 7, 2016

see #6425 for the current idea.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 8, 2016

All you want is a mapping, for each transformer (including a pipeline of
transformers), from input feature names to output feature names (or some
structured representation of the transformations, which I suspect is more
engineering than we're going to get). That's what #6425 provides.

On 8 October 2016 at 03:42, Andreas Mueller notifications@github.com
wrote:

see #6425 #6425 for
the current idea.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5523 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65fBsMwqmkDq3DEjMSUC-W-nfn9zks5qxnZxgaJpZM4GThGc
.

@makmanalp
Copy link

@makmanalp makmanalp commented Oct 11, 2016

We'll look into this, thank you!

@jimmywan
Copy link
Contributor

@jimmywan jimmywan commented Jun 21, 2017

Can someone provide a general update on the state of the world wrt this issue?

Will pandas DataFrame support always be a YMMV thing?
Guidance on what is/isn't considered safe for use with a pandas DataFrame instead of just an ndarray would be helpful. Perhaps something along the lines of the following (MADE UP EXAMPLE TABLE):

module/category can safely consume pandas DataFrame
sklearn.pipeline SAFE
sklearn.feature_selection SAFE
regressors YMMV
sklearn.feature_extraction NOT SAFE, no plan to implement
etc ...

Right now, I'm not sure of an approach other than "just try it and see if it throws exceptions".

We've tested a handful of hand-coded examples that seem to work just fine accepting a pandas DataFrame, but can't help thinking this will inevitably stop working right when we decide we need to make a seemingly trivial pipeline component swap... at which point everything falls down like a house of cards in a cryptic stack trace.

My initial thought process was to create a replacement pipeline object that can consume a pandas DataFrame that auto-generated wrappers for standard scikit-learn components to convert input/output DataFrame objects into numpy ndarray objects as necessary. That way I can write my own custom Selectors/Transformers so I can make use of pandas DataFrame primitives, but that seems a bit heavy handed. Especially so, if we're on the cusp of having "official" support for them.

I've been following a few different PRs, but it's hard to get a sense for which ones are abandoned and/or which reflect the current thinking:
Example:
#6425 (referenced Oct 2016 above in this thread)
#9012 (obvious overlaps with sklearn-pandas, but annotated as experimental?)
#3886 (superceded by #9012 ?)

@amueller
Copy link
Member

@amueller amueller commented Jun 21, 2017

This hinges critically on what you mean by "Can safely consume pandas DataFrame". If you mean a DataFrame containing only float numbers, we guarantee that everything will work. If there is even a single string anywhere, nothing will work.

I think any scikit-learn estimator returning a dataframe for any non-trivial (or maybe even trivial) operation is something that might never happen (though It would like it to).

@amueller
Copy link
Member

@amueller amueller commented Jun 21, 2017

#9012 will happen and will become stable, the PR is a first iteration (or 10th iteration, if you count non-merged ones ;)
#6425 is likely to happen, though it is not entirely related to pandas.
#3886 is indeed superceded by #9012

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 21, 2017

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 24, 2019

@GuillaumeDesforges
Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

Just checking briefly the code, there are all sorts of checks that happen arbitrarily at many place, using for instance

def check_X_y(X, y, accept_sparse=False, accept_large_sparse=True,

Plus many operations use indexing in a numpy fashion that wouldn't be accepted by pandas dataframe.

Keeping pandas in/out would be a must for day-to-day data science IMO, but scikit-learn seems to be designed in a way that would make it hard to be implemented.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

Good numerics are hard to implement on pandas dataframes. They are just not meant for that, in particular for multivariate operations (numerical operations across columns). Machine learning is mostly mutlivariate numerics.

That decision should be left up to the user? In my experience using scikit-learn extensively over the past two years I think two core and important functionalities that are missing and are a must have for a lot of ML use cases is support for passing sample and feature metadata. Full pandas dataframe support is a natural and elegant way to deal with some of this.

These kind of core functionalities are very important to keep the user base and bring in new users. Otherwise I see libraries like e.g. mlr3 eventually maturing and attracting users away from sklearn because I know they do (or will) fully support data frames and metadata.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

That decision should be left up to the user?
Well, the user is not implementing the algorithm.
Otherwise I see libraries like e.g. mlr3 eventually maturing and attracting users away from sklearn because I know they do (or will) fully support data frames and metadata.
mlr3 is in R, the dataframes are quite different from pandas dataframe. Maybe this makes it easier to implement. I agree that better support for feature names and heterogeneous data types is important. We are working on finding good technical solutions that do not lead to loss of performance and overly complicated code.

I think your approach of sticking with numpy arrays and at least supporting passing feature names or even better multiple feature metadata would work for many use cases. For passing training sample metadata you already support it in **fit_params and I know there is an effort to improve design. But I mentioned in scikit-learn/enhancement_proposals#16 that there are use cases where you would also need test sample metadata passed to transform methods and this isn't currently supported.

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

mlr3 is in R, the dataframes are quite different from pandas dataframe.

Computational scientists in life sciences research are usually very comfortable with both python and R and use both together (myself included). I'm pretty sure a significant percentage of the scikit-learn user base are life sciences researchers.

Currently the available mature ML libraries in R IMHO don't even come close to scikit-learn in terms of providing a well-designed API and making the utilitarian parts of ML very straightforward (pipelines, hyperparameter search, scoring, etc) whereas in R with these libraries you have to code it pretty much yourself. But mlr3 I see as a future big competition to scikit-learn as they are designing it from the ground up the right way.

@GuillaumeDesforges
Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

Good numerics are hard to implement on pandas dataframes. They are just
not meant for that, in particular for multivariate operations (numerical
operations across columns).

Maybe I am missing something, but wouldn't it be possible to unwrap the DataFrame (using df.values), do the computations and then wrap back to a new DataFrame ?

That is basically what I do manually between steps, and the only thing preventing the use of a Pipeline.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@GuillaumeDesforges
Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

In general no: it might not work (heterogeneous columns)

I think that Column Transformers and such can handle it indicidually.

it will lead to a lot of memory copies.

I understand that there are difficult design & implementation choices to make, and that is a sound argument.

However, I don't understand why you would argue that it is not a good idea to improve the way sklearn supports column meta data.

Allowing for instance to ingest a df with features, add a column thanks to a predictor, do more data manipulations, do another predict, all that in a Pipeline, is something that would be useful because it would (for instance) allow hyper parameters optimization in a much better integrated and elegant way.

Doing it with or without pandas is just a suggestion since it is the most common, easy and popular way to manipulate data, and that I don't see any benefits to rewrite more than what they did.

It would be up to the user to decide not to use this workflow when optimizing performances.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 24, 2019

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

I think the best solution would be to support pandas dataframes in and out for the sample and feature properties and properly passing and slicing them into train and test fit/transform. That would solve most uses cases while keeping the speed of the data matrix X as numpy arrays.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Oct 25, 2019

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

I hope I was clear in my previous post. I believe scikit-learn doesn't currently need to support pandas dataframes for X data, keep them as speedy numpy arrays. But what would solve many use cases is full support through the framework for pandas dataframes for metadata, i.e. sample properties and feature properties. This shouldn't be a performance burden even for memory copies as these two data structures will be minor compared to X and really only subsetting will be done on these.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Oct 25, 2019

Yes, those changes do help in many usecases, and we're working on them. But this issue is beyond that: #5523 (comment)

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Oct 25, 2019

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

@hermidalc
Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

Yes, full support for sample properties and feature properties as pandas dataframes. Discussion is already happening on sample properties and feature names in other PRs and issues, e.g. here #9566 and #14315

@183amir
Copy link

@183amir 183amir commented Apr 28, 2020

I've read up on this issue and looks like there are two major blockers here:

  1. pandas-dev/pandas#27211
  2. That pandas does not handle N-D arrays.

Have you considered adding support for xarrays instead? They don't have those limitations of pandas.

X = np.arange(10).reshape(5, 2)
assert np.asarray(xr.DataArray(X)) is X
assert np.asarray(xr.Dataset({"data": (("samples", "features"), X)}).data).base is X.base

There is a package called sklearn-xarray: https://phausamann.github.io/sklearn-xarray/content/wrappers.html that wraps scikit estimators to handle xarrays as input and output but that seems to have gone unmaintained for years. However, I wonder if wrappers are the way to go here.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Apr 28, 2020

xarray is actively being considered. It is being prototyped and worked on here: #16772 There is a usage notebook on what the API would look like in the PR.

(I will get back to it after we finish with the 0.23 release)

@gioxc88
Copy link

@gioxc88 gioxc88 commented Jun 12, 2020

I am also very interested in this feature.
It would solve infinite problems. Currently this is the solution I am using.
I wrote a wrapper around the sklearn.preprocessing module, which I called sklearn_wrapper

So instead of importing from sklearn.preprocessing I import from sklearn_wrapper.
For example:

# this
from sklearn.preprocessing import StandardScaler 
# becomes 
from sklearn_wrapper import StandardScaler

Below the implementation of this module. Try it out and let me know what you guys think

from functools import wraps
from itertools import chain

import pandas as pd
from sklearn import preprocessing, compose, feature_selection, decomposition
from sklearn.compose._column_transformer import _get_transformer_list

modules = (preprocessing, feature_selection, decomposition)


def base_wrapper(Parent):
    class Wrapper(Parent):

        def transform(self, X, **kwargs):
            result = super().transform(X, **kwargs)
            check = self.check_out(X, result)
            return check if check is not None else result

        def fit_transform(self, X, y=None, **kwargs):
            result = super().fit_transform(X, y, **kwargs)
            check = self.check_out(X, result)
            return check if check is not None else result

        def check_out(self, X, result):
            if isinstance(X, pd.DataFrame):
                result = pd.DataFrame(result, index=X.index, columns=X.columns)
                result = result.astype(X.dtypes.to_dict())
            return result

        def __repr__(self):
            name = Parent.__name__
            tmp = super().__repr__().split('(')[1]
            return f'{name}({tmp}'

    Wrapper.__name__ = Parent.__name__
    Wrapper.__qualname__ = Parent.__name__

    return Wrapper


def base_pca_wrapper(Parent):
    Parent = base_wrapper(Parent)

    class Wrapper(Parent):
        @wraps(Parent)
        def __init__(self, *args, **kwargs):
            self._prefix_ = kwargs.pop('prefix', 'PCA')
            super().__init__(*args, **kwargs)

        def check_out(self, X, result):
            if isinstance(X, pd.DataFrame):
                columns = [f'{self._prefix_}_{i}' for i in range(1, (self.n_components or X.shape[1]) + 1)]
                result = pd.DataFrame(result, index=X.index, columns=columns)
            return result

    return Wrapper


class ColumnTransformer(base_wrapper(compose.ColumnTransformer)):

    def check_out(self, X, result):
        if isinstance(X, pd.DataFrame):
            return pd.DataFrame(result, index=X.index, columns=self._columns[0]) if self._remainder[1] == 'drop' \
                else pd.DataFrame(result, index=X.index, columns=X.columns). \
                astype(self.dtypes.iloc[self._remainder[-1]].to_dict())


class SelectKBest(base_wrapper(feature_selection.SelectKBest)):

    def check_out(self, X, result):
        if isinstance(X, pd.DataFrame):
            return pd.DataFrame(result, index=X.index, columns=X.columns[self.get_support()]). \
                astype(X.dtypes[self.get_support()].to_dict())


def make_column_transformer(*transformers, **kwargs):
    n_jobs = kwargs.pop('n_jobs', None)
    remainder = kwargs.pop('remainder', 'drop')
    sparse_threshold = kwargs.pop('sparse_threshold', 0.3)
    verbose = kwargs.pop('verbose', False)
    if kwargs:
        raise TypeError('Unknown keyword arguments: "{}"'
                        .format(list(kwargs.keys())[0]))
    transformer_list = _get_transformer_list(transformers)
    return ColumnTransformer(transformer_list, n_jobs=n_jobs,
                             remainder=remainder,
                             sparse_threshold=sparse_threshold,
                             verbose=verbose)


def __getattr__(name):
    if name not in __all__:
        return

    for module in modules:
        Parent = getattr(module, name, None)
        if Parent is not None:
            break

    if Parent is None:
        return

    if module is decomposition:
        Wrapper = base_pca_wrapper(Parent)
    else:
        Wrapper = base_wrapper(Parent)

    return Wrapper


__all__ = [*[c for c in preprocessing.__all__ if c[0].istitle()],
           *[c for c in decomposition.__all__ if c[0].istitle()],
           'SelectKBest']


def __dir__():
    tmp = dir()
    tmp.extend(__all__)
    return tmp

@samosun
Copy link

@samosun samosun commented Dec 9, 2020

koaning/scikit-lego#304 provided another solution by Hot-fixing on the sklearn.pipeline.FeatureUnion

@premopie
Copy link

@premopie premopie commented Mar 2, 2021

koaning/scikit-lego#304 provided another solution by Hot-fixing on the sklearn.pipeline.FeatureUnion

I like the solution with pandify decorator - it's very ellegant IMHO (provided that the columns remain unchanged during transformation and there is no need to pickle the object). Thanks for letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
API and interoperability
Design/analysis stage
Andy's pets
wishful thinking
Pandas
  
To do
Linked pull requests

Successfully merging a pull request may close this issue.

None yet