-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas in, Pandas out? #5523
Comments
Scikit-learn was designed to work with a very generic input format. Perhaps the world around scikit-learn has changed a lot since in ways that make Pandas integration more important. It could still largely be supplied by third-party wrappers. But apart from the broader question, I think you should try to give examples of how Pandas-friendly output from standard estimators will differ and make a difference to usability. Examples I can think of:
|
Yep, off the top of my head:
If this is all in a transform, then it doesn't really affect how sklearn works by default. |
I don't think it can be implemented nicely as a transformer. It would be On 22 October 2015 at 17:40, naught101 notifications@github.com wrote:
|
Making "pandas in" better was kind of the idea behind the column transformer PR #3886. Maybe I should have looked more closely into what sklearn-pandas is already doing. I'm not entirely sure what the best way forward is there. The other thing that would be nice would be preserving column names in transformations / selecting them when doing feature selection. I don't find the issue where we discussed this right now. Maybe @jnothman remembers. I would really like that, though it would require major surgery with the input validation to preserve the column names :-/ Related #4196 |
though it would require major surgery with the input validation to
preserve the column names :-/
Not only input validation: every transform would have to describe what it
does to the input columns.
|
True, but that I think would be nice ;) One question is maybe whether we want this only in pipelines or everywhere. If we restrict it to pipelines, the input validation surgery would be less big. But I'm not sure how useful it would be. |
You can always do a pipeline with just one thing in it, right? So we kind of handle all cases (though it is hacky in the limit of 1 object) by restricting to just pipeline at first... |
+1. Starting with pipeline sounds nice, and cover all transformer in next step. I also have an impl with pandas and sklearn integration, which can revert columns info via |
I am a bit stupid, but aren't talking about something in the sample
OK, but that's all at the level of one transformer that takes Pandas in, |
Yep.
Yep, that's what I was thinking to start with. I would be more than happy with a wrapper that dealt with |
Then we are on the same page. I do think that @amueller has ideas about |
|
A note: I had wondered if one would only want to wrap the outermost estimator in an ensemble to provide this functionality to a user. I think the answer is: no, one wants to wrap atomic transformers too, to allow for dataframe-aware transformers within a pipeline (why not?). Without implementing this as a mixin, I think you're going to get issues with unnecessary parameter prefixing or else problems cloning (as in #5080). |
👍 |
Just wanted to toss out the solution I am using: def check_output(X, ensure_index=None, ensure_columns=None):
"""
Joins X with ensure_index's index or ensure_columns's columns when avaialble
"""
if ensure_index is not None:
if ensure_columns is not None:
if type(ensure_index) is pd.DataFrame and type(ensure_columns) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index, columns=ensure_columns.columns)
else:
if type(ensure_index) is pd.DataFrame:
X = pd.DataFrame(X, index=ensure_index.index)
return X I then create wrappers around sklearn's estimators that call this function on the output of transform e.g., from sklearn.preprocessing import StandardScaler as _StandardScaler
class StandardScaler(_StandardScaler):
def transform(self, X):
Xt = super(StandardScaler, self).transform(X)
return check_output(Xt, ensure_index=X, ensure_columns=X) Classifiers that need use of the index of the input dataframe X can just use its index (useful for timeseries as was pointed out). This approach has the benefit of being completely compatible with the existing sklearn design while also preserving the speed of computation (math operations and indexing on dataframes are up to 10x slower than numpy arrays, http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/). Unfortunately, it's a lot of tedious work to add to each estimator that could utilize it. |
Maybe it's only necessary to make a Pipeline variant with this magic... On 15 January 2016 at 02:30, Dean Wyatte notifications@github.com wrote:
|
Or just something that wraps a pipeline/estimator, no? I don't really understand why you'd call a function like that "check_*" when it's doing far more than just checking though... On 14 January 2016 10:45:44 am CST, Joel Nothman notifications@github.com wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
I'm not sure if Pipeline is the right place to start because all column name inheritance is estimator-specific e.g., scalers should inherit the column names of the input dataframe whereas models like PCA, should not. Feature selection estimators should inherit specific column names, but that is another problem, probably more related to #2007. Is it always the case that n_rows of all arrays is preserved during transform? If so, just inheriting the index of the input (if it exists) seems safe, but I'm not sure that getting a dataframe with default column names (e.g., [0, 1, 2, 3, ...]) is better than the current behavior from an end-user perspective, but if an explicit wrapper/meta-estimator is used, then at least the user will know what to expect. Also, agreed that check_* is a poor name -- I was doing quite a bit more validation in my function, and just stripped out the dataframe logic to post here. |
I think pipeline would be the place to start, though we would need to add something to all estimators that map the column names appropriately. |
Has anyone thought about the mechanics of how to pass the names around, from transformer to transformer, and perhaps how to track the provenance? Where would one store this? A friend of mine, @cbrummitt , has a similar problem, where each column of his design matrix is a functional form (e.g. x^2, x^3, x_1^3x_2^2, represented as sympy expressions), and he has transformers that act similarly to PolynomialFeatures, that can take in functional forms and generate more ones based on that. But he's using sympy to take the old expressions and generate new ones, and storing the expressions as string labels doesn't cut it, and gets complicated when you layer the function transformations. He could do all this outside the pipeline, but then he doesn't get the benefit of GridSearch, etc. I guess the more general version of our question is, how do you have some information that would be passed from transformer to transformer that is NOT the data itself? I can't come up with a great way without having pipeline-global state or having each transformer / estimator know about the previous ones, or having each step return multiple things, or something. We then also came up with the idea to modify pipeline to keep track of this, you'd have to change _fit() and _transform() and perhaps a few other things. That seems like our best option. This sounds crazy but what it feels like is that we really want our data matrix to be sympy expressions , and each transformation generates new expressions? This is gross, check_array() stops it from happening, and it'd make other steps down the pipeline angry. |
see #6425 for the current idea. |
All you want is a mapping, for each transformer (including a pipeline of On 8 October 2016 at 03:42, Andreas Mueller notifications@github.com
|
We'll look into this, thank you! |
Can someone provide a general update on the state of the world wrt this issue? Will pandas
Right now, I'm not sure of an approach other than "just try it and see if it throws exceptions". We've tested a handful of hand-coded examples that seem to work just fine accepting a pandas DataFrame, but can't help thinking this will inevitably stop working right when we decide we need to make a seemingly trivial pipeline component swap... at which point everything falls down like a house of cards in a cryptic stack trace. My initial thought process was to create a replacement pipeline object that can consume a pandas I've been following a few different PRs, but it's hard to get a sense for which ones are abandoned and/or which reflect the current thinking: |
This hinges critically on what you mean by "Can safely consume pandas DataFrame". If you mean a DataFrame containing only float numbers, we guarantee that everything will work. If there is even a single string anywhere, nothing will work. I think any scikit-learn estimator returning a dataframe for any non-trivial (or maybe even trivial) operation is something that might never happen (though It would like it to). |
The functionality #6425 is currently implemented (for some transformers and
extensible to others) via singledispatch in
https://codecov.io/gh/TeamHG-Memex/eli5 for what it's worth.
…On 21 June 2017 at 13:25, Andreas Mueller ***@***.***> wrote:
#9012 <#9012> will
happen and will become stable, the PR is a first iteration.
#6425 <#6425> is
likely to happen, though it is not entirely related to pandas.
#3886 <#3886> is indeed
superceded by #9012
<#9012>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5523 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61lgGBW1AoukPm_87elBjF2NGOUwks5sGI0-gaJpZM4GThGc>
.
|
Maybe I am missing something, but wouldn't it be possible to unwrap the DataFrame (using That is basically what I do manually between steps, and the only thing preventing the use of a |
Maybe I am missing something, but wouldn't it be possible to unwrap the
DataFrame (using df.values), do the computations and then wrap back to a new
DataFrame ?
In general no: it might not work (heterogeneous columns), and it will
lead to a lot of memory copies.
|
I think that Column Transformers and such can handle it indicidually.
I understand that there are difficult design & implementation choices to make, and that is a sound argument. However, I don't understand why you would argue that it is not a good idea to improve the way sklearn supports column meta data. Allowing for instance to ingest a df with features, add a column thanks to a predictor, do more data manipulations, do another predict, all that in a Pipeline, is something that would be useful because it would (for instance) allow hyper parameters optimization in a much better integrated and elegant way. Doing it with or without pandas is just a suggestion since it is the most common, easy and popular way to manipulate data, and that I don't see any benefits to rewrite more than what they did. It would be up to the user to decide not to use this workflow when optimizing performances. |
Leaving things up to the user to decide requires clearly explaining the
choice to the user. Most users do not read the documentation that would
explain such choices. Many would try what they think might work, and then
give up when they find it slow, not realising that it was their choice of
daraframe that made it so.
So we need to step with some care here. But we do need to keep solving
this as a high priority.
|
I think the best solution would be to support pandas dataframes in and out for the sample and feature properties and properly passing and slicing them into train and test fit/transform. That would solve most uses cases while keeping the speed of the data matrix X as numpy arrays. |
One important point missing from this arguments is that pandas is moving towards a columnar representation of the data, in a way that |
I hope I was clear in my previous post. I believe scikit-learn doesn't currently need to support pandas dataframes for X data, keep them as speedy numpy arrays. But what would solve many use cases is full support through the framework for pandas dataframes for metadata, i.e. sample properties and feature properties. This shouldn't be a performance burden even for memory copies as these two data structures will be minor compared to X and really only subsetting will be done on these. |
Yes, those changes do help in many usecases, and we're working on them. But this issue is beyond that: #5523 (comment) |
@hermidalc are you suggesting we let |
Yes, full support for sample properties and feature properties as pandas dataframes. Discussion is already happening on sample properties and feature names in other PRs and issues, e.g. here #9566 and #14315 |
I've read up on this issue and looks like there are two major blockers here:
Have you considered adding support for xarrays instead? They don't have those limitations of pandas. X = np.arange(10).reshape(5, 2)
assert np.asarray(xr.DataArray(X)) is X
assert np.asarray(xr.Dataset({"data": (("samples", "features"), X)}).data).base is X.base There is a package called |
xarray is actively being considered. It is being prototyped and worked on here: #16772 There is a usage notebook on what the API would look like in the PR. (I will get back to it after we finish with the 0.23 release) |
I am also very interested in this feature. So instead of importing from
Below the implementation of this module. Try it out and let me know what you guys think
|
koaning/scikit-lego#304 provided another solution by Hot-fixing on the sklearn.pipeline.FeatureUnion |
I like the solution with |
Mentioning #23001 here, because this is the most popular issue on the topic. |
I guess this can be closed with #23734. |
@premopie @183amir @avm19 @gioxc88 @naught101 it would be awesome to get your feedback on the feature we implemented and whether it addresses your use-cases. |
I forgot what was the serious use-case I had, but this feature works in a toy example. Here are the changes I made. So yeah, it made me a bit happier :) @amueller |
At the moment, it's possible to use a pandas dataframe as an input for most sklearn fit/predict/transform methods, but you get a numpy array out. It would be really nice to be able to get data out in the same format you put it in.
This isn't perfectly straightforward, because if your Dataframe contains columns that aren't numeric, then the intermediate numpy arrays will cause sklearn to fail, because they wil be
dtype=object
, instead ofdtype=float
. This can be solved by having a Dataframe->ndarray transformer, that maps the non-numeric data to numeric data (e.g. integers representing classes/categories). sklearn-pandas already does this, although it currently doesn't have aninverse_transform
, but that shouldn't be hard to add.I feel like a transform like this would be really useful to have in sklearn - it's the kind of thing that anyone working with datasets with multiple data types would find useful. What would it take to get something like this into sklearn?
The text was updated successfully, but these errors were encountered: