Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas in, Pandas out? #5523

Open
naught101 opened this issue Oct 22, 2015 · 55 comments

Comments

@naught101
Copy link

@naught101 naught101 commented Oct 22, 2015

At the moment, it's possible to use a pandas dataframe as an input for most sklearn fit/predict/transform methods, but you get a numpy array out. It would be really nice to be able to get data out in the same format you put it in.

This isn't perfectly straightforward, because if your Dataframe contains columns that aren't numeric, then the intermediate numpy arrays will cause sklearn to fail, because they wil be dtype=object, instead of dtype=float. This can be solved by having a Dataframe->ndarray transformer, that maps the non-numeric data to numeric data (e.g. integers representing classes/categories). sklearn-pandas already does this, although it currently doesn't have an inverse_transform, but that shouldn't be hard to add.

I feel like a transform like this would be really useful to have in sklearn - it's the kind of thing that anyone working with datasets with multiple data types would find useful. What would it take to get something like this into sklearn?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 22, 2015

Scikit-learn was designed to work with a very generic input format. Perhaps the world around scikit-learn has changed a lot since in ways that make Pandas integration more important. It could still largely be supplied by third-party wrappers.

But apart from the broader question, I think you should try to give examples of how Pandas-friendly output from standard estimators will differ and make a difference to usability. Examples I can think of:

  • all methods could copy the index from the input
  • transformers should output appropriately-named columns
  • multiclass predict_proba can label columns with class names
@naught101

This comment has been minimized.

Copy link
Author

@naught101 naught101 commented Oct 22, 2015

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data (pass mixed dataframe, transform categorical columns with LabelBinarizer, the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to pass a dataframe, and only use a subset of the data, and arbitrarily transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn works by default.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 22, 2015

I don't think it can be implemented nicely as a transformer. It would be
one or more metaestimators or mixins. I think they should be initially
implemented externally and demonstrated as useful

On 22 October 2015 at 17:40, naught101 notifications@github.com wrote:

Yep, off the top of my head:

  • the index can be really useful, e.g. for creating timed lagged
    variables (e.g lag 1 day, on daily data with some missing days)
  • sklearn regressors could be used transparently with categorical data
    (pass mixed dataframe, transform categorical columns with LabelBinarizer,
    the inverse_transform it back).
  • sklearn-pandas already provides a nice interface that allows you to
    pass a dataframe, and only use a subset of the data, and arbitrarily
    transform individual columns.

If this is all in a transform, then it doesn't really affect how sklearn
works by default.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 23, 2015

Making "pandas in" better was kind of the idea behind the column transformer PR #3886. Maybe I should have looked more closely into what sklearn-pandas is already doing. I'm not entirely sure what the best way forward is there.

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now. Maybe @jnothman remembers. I would really like that, though it would require major surgery with the input validation to preserve the column names :-/

Related #4196

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 23, 2015

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 23, 2015

True, but that I think would be nice ;)

One question is maybe whether we want this only in pipelines or everywhere. If we restrict it to pipelines, the input validation surgery would be less big. But I'm not sure how useful it would be.

@kastnerkyle

This comment has been minimized.

Copy link
Member

@kastnerkyle kastnerkyle commented Oct 23, 2015

You can always do a pipeline with just one thing in it, right? So we kind of handle all cases (though it is hacky in the limit of 1 object) by restricting to just pipeline at first...

@sinhrks

This comment has been minimized.

Copy link
Contributor

@sinhrks sinhrks commented Oct 23, 2015

+1. Starting with pipeline sounds nice, and cover all transformer in next step.

I also have an impl with pandas and sklearn integration, which can revert columns info via inverse_transform (dirty hack though...)

http://pandas-ml.readthedocs.org/en/latest/sklearn.html

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2015

• the index can be really useful, e.g. for creating timed lagged variables
(e.g lag 1 day, on daily data with some missing days)

I am a bit stupid, but aren't talking about something in the sample
direction here, rather than the feature direction?

• sklearn regressors could be used transparently with categorical data (pass
mixed dataframe, transform categorical columns with LabelBinarizer, the
inverse_transform it back).

• sklearn-pandas already provides a nice interface that allows you to pass a
dataframe, and only use a subset of the data, and arbitrarily transform
individual columns.

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it? Rather than attempting a
modification on all the objects of scikit-learn (which is a risky
endeavior), we could first implement this transformer (I believe that
@amueller has this on his mind).

@naught101

This comment has been minimized.

Copy link
Author

@naught101 naught101 commented Oct 24, 2015

sample direction here, rather than the feature direction?

Yep.

OK, but that's all at the level of one transformer that takes Pandas in, and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with a wrapper that dealt with X and y as dataframes. I don't see an obvious reason to screw with sklearn's internals.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2015

OK, but that's all at the level of one transformer that takes Pandas in,
and gives a data matrix out, isn't it?

Yep, that's what I was thinking to start with. I would be more than happy with
a wrapper that dealt with X and y as dataframes. I don't see an obvious reason
to screw with sklearn's internals.

Then we are on the same page. I do think that @amueller has ideas about
this, and we might see some discussion, and maybe code soon.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 24, 2015

The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now.

#5172

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Nov 2, 2015

A note: I had wondered if one would only want to wrap the outermost estimator in an ensemble to provide this functionality to a user. I think the answer is: no, one wants to wrap atomic transformers too, to allow for dataframe-aware transformers within a pipeline (why not?). Without implementing this as a mixin, I think you're going to get issues with unnecessary parameter prefixing or else problems cloning (as in #5080).

@languitar

This comment has been minimized.

Copy link

@languitar languitar commented Nov 27, 2015

👍

@dwyatte

This comment has been minimized.

Copy link

@dwyatte dwyatte commented Jan 14, 2016

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
    """
    Joins X with ensure_index's index or ensure_columns's columns when avaialble
    """
    if ensure_index is not None:
        if ensure_columns is not None:
            if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
        else:
            if type(ensure_index) is pd.DataFrame:
                X = pd.DataFrame(X, index=ensure_index.index)
    return X

I then create wrappers around sklearn's estimators that call this function on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler 
class StandardScaler(_StandardScaler):
    def transform(self, X):
        Xt = super(StandardScaler, self).transform(X)
        return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the existing sklearn design while also preserving the speed of computation (math operations and indexing on dataframes are up to 10x slower than numpy arrays, http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/). Unfortunately, it's a lot of tedious work to add to each estimator that could utilize it.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jan 14, 2016

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of computation
(math operations and indexing on dataframes are up to 10x slower than numpy
arrays). Unfortunately, it's a lot of tedious work to add to each estimator
that could utilize it.


Reply to this email directly or view it on GitHub
#5523 (comment)
.

@naught101

This comment has been minimized.

Copy link
Author

@naught101 naught101 commented Jan 14, 2016

Or just something that wraps a pipeline/estimator, no?

I don't really understand why you'd call a function like that "check_*" when it's doing far more than just checking though...

On 14 January 2016 10:45:44 am CST, Joel Nothman notifications@github.com wrote:

Maybe it's only necessary to make a Pipeline variant with this magic...

On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com
wrote:

Just wanted to toss out the solution I am using:

def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns
when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and
type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index,
columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X

I then create wrappers around sklearn's estimators that call this
function
on the output of transform e.g.,

from sklearn.preprocessing import StandardScaler as _StandardScaler
class MinMaxScaler(_MinMaxScaler):
def transform(self, X):
Xt = super(MinMaxScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X)

Classifiers that need use of the index of the input dataframe X can
just
use its index (useful for timeseries as was pointed out).

This approach has the benefit of being completely compatible with the
existing sklearn design while also preserving the speed of
computation
(math operations and indexing on dataframes are up to 10x slower than
numpy
arrays). Unfortunately, it's a lot of tedious work to add to each
estimator
that could utilize it.


Reply to this email directly or view it on GitHub

#5523 (comment)
.


Reply to this email directly or view it on GitHub:
#5523 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@dwyatte

This comment has been minimized.

Copy link

@dwyatte dwyatte commented Jan 15, 2016

I'm not sure if Pipeline is the right place to start because all column name inheritance is estimator-specific e.g., scalers should inherit the column names of the input dataframe whereas models like PCA, should not. Feature selection estimators should inherit specific column names, but that is another problem, probably more related to #2007.

Is it always the case that n_rows of all arrays is preserved during transform? If so, just inheriting the index of the input (if it exists) seems safe, but I'm not sure that getting a dataframe with default column names (e.g., [0, 1, 2, 3, ...]) is better than the current behavior from an end-user perspective, but if an explicit wrapper/meta-estimator is used, then at least the user will know what to expect.

Also, agreed that check_* is a poor name -- I was doing quite a bit more validation in my function, and just stripped out the dataframe logic to post here.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jan 15, 2016

I think pipeline would be the place to start, though we would need to add something to all estimators that map the column names appropriately.

@makmanalp

This comment has been minimized.

Copy link

@makmanalp makmanalp commented Oct 7, 2016

transformers should output appropriately-named columns @naught101

though it would require major surgery with the input validation to preserve the column names :-/ @amueller

Not only input validation: every transform would have to describe what it does to the input columns. @GaelVaroquaux

Has anyone thought about the mechanics of how to pass the names around, from transformer to transformer, and perhaps how to track the provenance? Where would one store this?

A friend of mine, @cbrummitt , has a similar problem, where each column of his design matrix is a functional form (e.g. x^2, x^3, x_1^3x_2^2, represented as sympy expressions), and he has transformers that act similarly to PolynomialFeatures, that can take in functional forms and generate more ones based on that. But he's using sympy to take the old expressions and generate new ones, and storing the expressions as string labels doesn't cut it, and gets complicated when you layer the function transformations. He could do all this outside the pipeline, but then he doesn't get the benefit of GridSearch, etc.

I guess the more general version of our question is, how do you have some information that would be passed from transformer to transformer that is NOT the data itself? I can't come up with a great way without having pipeline-global state or having each transformer / estimator know about the previous ones, or having each step return multiple things, or something.

We then also came up with the idea to modify pipeline to keep track of this, you'd have to change _fit() and _transform() and perhaps a few other things. That seems like our best option.

This sounds crazy but what it feels like is that we really want our data matrix to be sympy expressions , and each transformation generates new expressions? This is gross, check_array() stops it from happening, and it'd make other steps down the pipeline angry.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Oct 7, 2016

see #6425 for the current idea.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 8, 2016

All you want is a mapping, for each transformer (including a pipeline of
transformers), from input feature names to output feature names (or some
structured representation of the transformations, which I suspect is more
engineering than we're going to get). That's what #6425 provides.

On 8 October 2016 at 03:42, Andreas Mueller notifications@github.com
wrote:

see #6425 #6425 for
the current idea.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5523 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65fBsMwqmkDq3DEjMSUC-W-nfn9zks5qxnZxgaJpZM4GThGc
.

@makmanalp

This comment has been minimized.

Copy link

@makmanalp makmanalp commented Oct 11, 2016

We'll look into this, thank you!

@jimmywan

This comment has been minimized.

Copy link
Contributor

@jimmywan jimmywan commented Jun 21, 2017

Can someone provide a general update on the state of the world wrt this issue?

Will pandas DataFrame support always be a YMMV thing?
Guidance on what is/isn't considered safe for use with a pandas DataFrame instead of just an ndarray would be helpful. Perhaps something along the lines of the following (MADE UP EXAMPLE TABLE):

module/category can safely consume pandas DataFrame
sklearn.pipeline SAFE
sklearn.feature_selection SAFE
regressors YMMV
sklearn.feature_extraction NOT SAFE, no plan to implement
etc ...

Right now, I'm not sure of an approach other than "just try it and see if it throws exceptions".

We've tested a handful of hand-coded examples that seem to work just fine accepting a pandas DataFrame, but can't help thinking this will inevitably stop working right when we decide we need to make a seemingly trivial pipeline component swap... at which point everything falls down like a house of cards in a cryptic stack trace.

My initial thought process was to create a replacement pipeline object that can consume a pandas DataFrame that auto-generated wrappers for standard scikit-learn components to convert input/output DataFrame objects into numpy ndarray objects as necessary. That way I can write my own custom Selectors/Transformers so I can make use of pandas DataFrame primitives, but that seems a bit heavy handed. Especially so, if we're on the cusp of having "official" support for them.

I've been following a few different PRs, but it's hard to get a sense for which ones are abandoned and/or which reflect the current thinking:
Example:
#6425 (referenced Oct 2016 above in this thread)
#9012 (obvious overlaps with sklearn-pandas, but annotated as experimental?)
#3886 (superceded by #9012 ?)

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 21, 2017

This hinges critically on what you mean by "Can safely consume pandas DataFrame". If you mean a DataFrame containing only float numbers, we guarantee that everything will work. If there is even a single string anywhere, nothing will work.

I think any scikit-learn estimator returning a dataframe for any non-trivial (or maybe even trivial) operation is something that might never happen (though It would like it to).

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 21, 2017

#9012 will happen and will become stable, the PR is a first iteration (or 10th iteration, if you count non-merged ones ;)
#6425 is likely to happen, though it is not entirely related to pandas.
#3886 is indeed superceded by #9012

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 21, 2017

@sam-s

This comment has been minimized.

Copy link
Contributor

@sam-s sam-s commented Aug 7, 2018

All my transformers return DataFrames when given DataFrames.
When I input a 300-column DataFrame into a Pipeline and receive a 500-column ndarray, I cannot effectively learn much from it, by, e.g., feature_selection, because I do not have the column names anymore. If, say, mutual_info_classif tells me that only columns 30 and 75 are important, I cannot figure out how to simplify my original Pipeline for production.
Thus it is critical for my use case to keep my data in a DataFrame.
Thank you.

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jun 26, 2019

@sam-s I totally agree. In the "short" term, this will be addressed by #13307 and scikit-learn/enhancement_proposals#18

You won't get a pandas dataframe, but you'll get the column name to create one.

Can you please give a more concrete example, though? Because if all transformers return DataFrames, things should work (or be made to work more easily than the proposals above).

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 3, 2019

Slight update via pandas-dev/pandas#27211
which puts a damper on my hopes. It looks like we can not trust there to be a zero-copy round-trip, and so wrapping and unwrapping into pandas will result in substantial cost.

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Jul 5, 2019

Slight update via pandas-dev/pandas#27211 which puts a damper on my hopes. It looks like we can not trust there to be a zero-copy round-trip, and so wrapping and unwrapping into pandas will result in substantial cost.

yeah, but I guess once we cover the feature and sample props (row names and "indices" being a kinda sample prop), most related usecases which kinda need pandas now would be covered, right?

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 5, 2019

@adrinjalali I'm not sure what you mean by "most related usecases with kinda need pandas". I saw this issue not primarily as supporting pandas to implement features within scikit-learn, but to have scikit-learn integrate more easily in a pandas-based workflow.

@janosh

This comment has been minimized.

Copy link

@janosh janosh commented Jul 12, 2019

Just out of curiosity, is there a timeframe within which improved Pandas compatibility is expected to land? I'm specifically interested in Pandas in -> Pandas out for StandardScaler.

@adrinjalali adrinjalali added this to To do in Pandas Oct 21, 2019
@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

I have a use case where I need pandas dataframes preserved through each step in a Pipeline. For example a pipeline with 1) feature selection step filtering features based on data, 2) data transformation step, 3) another feature selection step to filter for specific feature column names or original indices, 4) standardization, 5) classification.

Step 3) I believe is currently not possible in sklearn, even with a numpy array input, because original feature indices are meaningless when the the data gets to 3) since in 1) there was a feature selection step. If pandas dataframes were being preserved in the pipeline it would work because I could filter by column name in 3).

Am I wrong in thinking there is currently no way to do this even with numpy array input?

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Oct 24, 2019

You're right that it's not supported, and supporting it would not be trivial. Related to your usecase, we're working on passing feature names along the pipeline (as you see in the linked PRs and proposals above). That should hopefully help with your case once it's done. I'm not sure if it helps, but you could also have a look at https://github.com/scikit-learn-contrib/sklearn-pandas

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

You're right that it's not supported, and supporting it would not be trivial. Related to your usecase, we're working on passing feature names along the pipeline (as you see in the linked PRs and proposals above). That should hopefully help with your case once it's done.

Thanks for confirmation, yes being able to pass around feature names (or other feature properties) to fit methods and have them properly sliced during each feature selection step would be fine for this use case.

I'm not sure if it helps, but you could also have a look at https://github.com/scikit-learn-contrib/sklearn-pandas

Earlier I read through their docs and maybe I'm not seeing it but most (or all) of their features are obsolete now in scikit-learn 0.21 with sklearn.compose.ColumnTransformer? Also it doesn't seem that they support pandas out it looks like numpy arrays after transforms.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 24, 2019

@GuillaumeDesforges

This comment has been minimized.

Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

Just checking briefly the code, there are all sorts of checks that happen arbitrarily at many place, using for instance

def check_X_y(X, y, accept_sparse=False, accept_large_sparse=True,

Plus many operations use indexing in a numpy fashion that wouldn't be accepted by pandas dataframe.

Keeping pandas in/out would be a must for day-to-day data science IMO, but scikit-learn seems to be designed in a way that would make it hard to be implemented.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

Good numerics are hard to implement on pandas dataframes. They are just not meant for that, in particular for multivariate operations (numerical operations across columns). Machine learning is mostly mutlivariate numerics.

That decision should be left up to the user? In my experience using scikit-learn extensively over the past two years I think two core and important functionalities that are missing and are a must have for a lot of ML use cases is support for passing sample and feature metadata. Full pandas dataframe support is a natural and elegant way to deal with some of this.

These kind of core functionalities are very important to keep the user base and bring in new users. Otherwise I see libraries like e.g. mlr3 eventually maturing and attracting users away from sklearn because I know they do (or will) fully support data frames and metadata.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

That decision should be left up to the user?
Well, the user is not implementing the algorithm.
Otherwise I see libraries like e.g. mlr3 eventually maturing and attracting users away from sklearn because I know they do (or will) fully support data frames and metadata.
mlr3 is in R, the dataframes are quite different from pandas dataframe. Maybe this makes it easier to implement. I agree that better support for feature names and heterogeneous data types is important. We are working on finding good technical solutions that do not lead to loss of performance and overly complicated code.

I think your approach of sticking with numpy arrays and at least supporting passing feature names or even better multiple feature metadata would work for many use cases. For passing training sample metadata you already support it in **fit_params and I know there is an effort to improve design. But I mentioned in scikit-learn/enhancement_proposals#16 that there are use cases where you would also need test sample metadata passed to transform methods and this isn't currently supported.

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 24, 2019

mlr3 is in R, the dataframes are quite different from pandas dataframe.

Computational scientists in life sciences research are usually very comfortable with both python and R and use both together (myself included). I'm pretty sure a significant percentage of the scikit-learn user base are life sciences researchers.

Currently the available mature ML libraries in R IMHO don't even come close to scikit-learn in terms of providing a well-designed API and making the utilitarian parts of ML very straightforward (pipelines, hyperparameter search, scoring, etc) whereas in R with these libraries you have to code it pretty much yourself. But mlr3 I see as a future big competition to scikit-learn as they are designing it from the ground up the right way.

@GuillaumeDesforges

This comment has been minimized.

Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

Good numerics are hard to implement on pandas dataframes. They are just
not meant for that, in particular for multivariate operations (numerical
operations across columns).

Maybe I am missing something, but wouldn't it be possible to unwrap the DataFrame (using df.values), do the computations and then wrap back to a new DataFrame ?

That is basically what I do manually between steps, and the only thing preventing the use of a Pipeline.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2019

@GuillaumeDesforges

This comment has been minimized.

Copy link

@GuillaumeDesforges GuillaumeDesforges commented Oct 24, 2019

In general no: it might not work (heterogeneous columns)

I think that Column Transformers and such can handle it indicidually.

it will lead to a lot of memory copies.

I understand that there are difficult design & implementation choices to make, and that is a sound argument.

However, I don't understand why you would argue that it is not a good idea to improve the way sklearn supports column meta data.

Allowing for instance to ingest a df with features, add a column thanks to a predictor, do more data manipulations, do another predict, all that in a Pipeline, is something that would be useful because it would (for instance) allow hyper parameters optimization in a much better integrated and elegant way.

Doing it with or without pandas is just a suggestion since it is the most common, easy and popular way to manipulate data, and that I don't see any benefits to rewrite more than what they did.

It would be up to the user to decide not to use this workflow when optimizing performances.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Oct 24, 2019

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

I think the best solution would be to support pandas dataframes in and out for the sample and feature properties and properly passing and slicing them into train and test fit/transform. That would solve most uses cases while keeping the speed of the data matrix X as numpy arrays.

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Oct 25, 2019

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that np.array(pd.DataFrame(numpy_data)) will have two guaranteed memory copies. That's why it's not as easy as just keeping the dataframe and using values whenever we need speed.

I hope I was clear in my previous post. I believe scikit-learn doesn't currently need to support pandas dataframes for X data, keep them as speedy numpy arrays. But what would solve many use cases is full support through the framework for pandas dataframes for metadata, i.e. sample properties and feature properties. This shouldn't be a performance burden even for memory copies as these two data structures will be minor compared to X and really only subsetting will be done on these.

@adrinjalali

This comment has been minimized.

Copy link
Member

@adrinjalali adrinjalali commented Oct 25, 2019

Yes, those changes do help in many usecases, and we're working on them. But this issue is beyond that: #5523 (comment)

@NicolasHug

This comment has been minimized.

Copy link
Contributor

@NicolasHug NicolasHug commented Oct 25, 2019

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

@hermidalc

This comment has been minimized.

Copy link
Contributor

@hermidalc hermidalc commented Oct 25, 2019

@hermidalc are you suggesting we let X be a numpy array, and assign the meta data in other dataframe object(s)?

Yes, full support for sample properties and feature properties as pandas dataframes. Discussion is already happening on sample properties and feature names in other PRs and issues, e.g. here #9566 and #14315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Andy's pets
wishful thinking
API and interoperability
Design/analysis stage
Pandas
  
To do
You can’t perform that action at this time.