Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP 8: Propagating feature names #18

Open
wants to merge 3 commits into
base: master
from

Conversation

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 28, 2019

With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.

The PR implementing it is scikit-learn/scikit-learn#13307

gets recursively called on each step of a ``Pipeline`` so that the feature
names get propagated throughout the full ``Pipeline``. This will allow to
inspect the input and output feature names in each step of a ``Pipeline``.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali May 31, 2019

Member

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been fit.

This comment has been minimized.

Copy link
@amueller

amueller May 31, 2019

Member

Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jul 14, 2019

Author Member

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names.

That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the fit method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually let fit handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.

potentially removing the need to have an explicit output feature names *getter
method*. The "update feature names" method would then mainly be used for
setting the input features and making sure they get propagated.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali May 31, 2019

Member

+1 for having them [almost] everywhere.

standing estimators and Pipelines. However, the clear downside of this
consistency is that this would add one line to each ``fit`` method throughout
scikit-learn.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali May 31, 2019

Member

there's also the option of having a fit which does some common tasks such as setting the feature names, and letting the child classes only implement _fit. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented May 31, 2019

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

@amueller

This comment has been minimized.

Copy link
Member

amueller commented May 31, 2019

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

can you elaborate? I don't remember that part.

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented May 31, 2019

can you elaborate? I don't remember that part.

I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 3, 2019

Copy link
Member

amueller left a comment

I think the slep needs a list of use-cases, in particular comparing the pandas and non-pandas one and checking if there's other relevant cases.
Do we ever actually change feature names that have been set? Maybe to simplify them?

slep006.rst Outdated Show resolved Hide resolved
slep006.rst Outdated Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The core idea of this proposal is that all transformers get a transformative
*"update feature names"* method that can determine the output feature names,

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

Maybe say that the name is up for discussion?

The ``Pipeline`` and all meta-estimators implement this method by calling it
recursively on all its child steps / sub-estimators, and in this way the input
features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

Maybe explain why that is necessary?

features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.

A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

As @adrinjalali says, there is also the possibility to set them during fit.

slep006.rst Outdated Show resolved Hide resolved
- Transformers based on arbitrary functions


Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method?

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.

It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.


For consistency, we could also add them to *all* estimators.

For a regressor or classifier, the method could set the ``input_feature_names_``

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

Why? you mean the output feature names, right?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jul 14, 2019

Author Member

Regressors don't have output features?

But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)

----------------------

This SLEP does not affect backward compatibility, as all described attributes
and methods would be new ones, not affecting existing ones.

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

Well if we reuse get_feature_names then we add a new parameter in some cases but the old behavior still works.

slep006.rst Outdated Show resolved Hide resolved
with a set of custom input feature names that are not identical to the original
DataFrame column names, the stored column names to do validation and the stored
column names to propagate the feature names would get out of sync. Or should
calling ``get_feature_names`` also affect future validation in a ``predict()``

This comment has been minimized.

Copy link
@amueller

amueller Jun 26, 2019

Member

I would vote for this option.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jun 26, 2019

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

transformative "update feature names" method and calling it recursively in the
Pipeline setting the ``input_feature_names_`` attribute on each step.

1. Only implement the "update feature names" method and require the user to

This comment has been minimized.

Copy link
@amueller

amueller Jul 2, 2019

Member

I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Jul 14, 2019

Author Member

I don't fully understand this comment. What do you man with "operating on output feature names" ?

(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 2, 2019

Coming back to this, I feel like I now favor a more invasive approach.
What I'm mostly thinking about right now is how feature names can enter an estimator. I feel if they enter any other way than fit, the user might call fit with a mismatching X and names and results will be inconsistent.
So there would be a benefit in providing feature names only via fit. If X is a dataframe, that's easy. If X is not a dataframe, I can see two options, both of which are very invasive:
a) add a feature_names_in parameter to fit (everywhere)
b) Create a ndarry subclass that has a feature_name attribute that stores the feature names.

a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing.

If we have always the feature names in fit, we can also do the transformation in fit, and so the user never has to call any method.

I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case.
I think it's more important to ensure that feature names and actual computation are in sync.

Therefore my preferred solution right now:

  1. Ensure feature_names_in is available during fit
  2. Set feature_names_out at the end of fit
  3. have the pipeline pass the out from the previous step to the in from the next step
  4. profit

The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class.

Given that 2) requires touching every fit (of a transformer) anyway, maybe passing it in as an argument is easiest? I'm unsure about adding a custom class, but I don't really like the "set it before fit" that is currently implemented because it's very implicit and side-effect-y.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 2, 2019

The downside of passing the feature names into fit as separate argument is that it requires special-casing passthrough in the pipeline which is a mess :-/ but well..

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 2, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 3, 2019

@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more.

That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already.

Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;)

Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility.

Which might show that allowing kwargs wasn't / isnt a good way to do sample props?
In a way passing things through fit is basically "attaching properties to features/columns", only it looks to me as if the routing questions are much easier (or just different) than with sample props.

We could go even a step further and do feature_props={'names': feature_names} but for now that seems like unnecessary indirections.

I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function make_named_array(X, feature_names) and that creates an object of type NamedNdarray that has an attribute feature_names and we expect the user to do that if they don't want to use pandas as input, and we wrap the output of transformers with that.
It's basically the same as pandas-in, pandas-out, only that we ensure there's really zero copy and it's future-proof as long as we rely on numpy.

BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in fit, as it was meant to be ;)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 3, 2019

Asked pandas: pandas-dev/pandas#27211

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 3, 2019

Also asked xarray:
pydata/xarray#3077

I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy.

It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 3, 2019

@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue.
Also, I assume it will require a lot of code for many of the estimators as pandas dataframes and numpy arrays likely require different codepaths even for simple things like scaling.

tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue.

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Jul 5, 2019

If we go down the path of having a NamedNdarray, then why not keep the sample props also right attached to the array? If we do that, it starts getting closer to the xarray's implementation, so to me it seems like a better solution to just support/use the xarray.DataArray. (They even have a Dataset object BTW).

@jnothman do you see major drawbacks to using an xarray.DataArray like object to pass around sample props? I know it'd be tricky to handle the case where we want a scorer or an estimator to ignore a sample property but the prop is there, but that can be handled by a metaestimator removing those attributes before feeding the data to the estimator.

@amueller are you already working on a NamedNdarray like solution?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 5, 2019

Stephan said in pydata/xarray#3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate.

@adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use sample_weights etc? So there's definitely some benefit but the hard problems are not actually addressed :-/

I haven't started implementing a NamedNDarray solution and I think right now I would prefer xarray.DataArray. I'm still not convinced that fit arguments are bad, though they have backward compatibility issues as @jnothman pointed out.

Right now, ColumnTransformer works on xarray.DataArray if we call the column dimension columns. Is that something we would want to enforce? That would allow us to write similar code to support pandas and xarray, but it might be a bit odd from an xarray perspective? From an sklearn view it would probably make the most sense to call the two dimensions samples and features, but then we would need to do more duck-typing for xarray vs pandas.

slep006.rst Outdated Show resolved Hide resolved
@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 16, 2019

https://github.com/BIDS/datarray#prior-art seems an interesting list.
larry seems closest to what we want, though is unmaintained.

Also interesting is Table from data8, but it's a pure column store: https://github.com/data-8/datascience/blob/master/datascience/tables.py

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

jorisvandenbossche commented Jul 16, 2019

I'm not sure if we should also take @jorisvandenbossche's proposal to not coerce dataframes to numpy arrays in preprocessing into account here, though it's somewhat orthogonal.

I think that it is something worth exploring in itself, but will never be a complete solution for this discussion, so I would keep it separate.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 16, 2019

Fun question for our custom subclass: does our class support nd-arrays? Probably, right? Or will it revert back to being a numpy array when you reshape it?

I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.

@shoyer

This comment has been minimized.

Copy link

shoyer commented Jul 16, 2019

In my opinion, this structure should behave as close as possible as an ndarray, because it is really the model of scikit-learn. This is unlike pandas (and I believe xarray), which indices and does alignment with them, and also has different indexing rules.

The biggest difference between xarray.DataArray and numpy.ndarray is how we do broadcasting: broadcasting is done by dimension name instead of by integer position. This is arguably much more intuitive, but definitely means you cannot simply substitute NumPy -> Xarray and expect everything to work.

Otherwise, xarray's API is very similar to NumPy, much more so than pandas:

  • Indexing like array[] works like NumPy, with the exception of broadcasting for vectorized indexing.
  • If you add coordinate labels, you also get automatic alignment, but coordinate labels are entirely optional.
  • One difference is that aggregations like array.mean() default to NaN skipping versions.

I think there is potentially room in the ecosystem for a simpler labeled array project, e.g., only with labeled axes/dimensions but not tick/coordinates labels. The challenge is deciding where to draw the line. My experience has been that there is a fundamental trade-off between compatibility with unlabeled array APIs (NumPy) and label based expressiveness.

Xarray chose to go "all in" on label-based expressiveness. This makes it too heavy-weight for some use-cases, but it does provide strong rewards for specifying labels, in the form of a large (and growing) label powered API.

For return values from scikit-learn estimators, I like the light-weight model of patsy.DesignMatrix, which is just a numpy array plus metadata that gets lost in any operation:

>>> dmatrix("x1 + x2", data)
DesignMatrix with shape (8, 3)
  Intercept        x1        x2
          1   1.76405  -0.10322
          1   0.40016   0.41060
          1   0.97874   0.14404
          1   2.24089   1.45427
          1   1.86756   0.76104
          1  -0.97728   0.12168
          1   0.95009   0.44386
          1  -0.15136   0.33367
  Terms:
    'Intercept' (column 0)
    'x1' (column 1)
    'x2' (column 2)

>>> x + 1
 array([[2.        , 2.76405235, 0.89678115],
       [2.        , 1.40015721, 1.4105985 ],
       [2.        , 1.97873798, 1.14404357],
       [2.        , 3.2408932 , 2.45427351],
       [2.        , 2.86755799, 1.76103773],
       [2.        , 0.02272212, 1.12167502],
       [2.        , 1.95008842, 1.44386323],
       [2.        , 0.84864279, 1.33367433]])
@shoyer

This comment has been minimized.

Copy link

shoyer commented Jul 16, 2019

I think I have to look into the numpy internals way more to understand where we would need to hook in to make sure we update or error if the shape is changed.

I'm not sure this is possible. NumPy has never had a complete subclassing API, which is part of the reason why classes like np.matrix have never worked particularly well. In general, we (NumPy developers) have been trying to discourage writing ndarray subclasses in favor of duck arrays. These are a lot more work to implement but they rule out the possibility of any unexpected interactions with NumPy's internals.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

@shoyer it actually looks quite simple to enforce column consistency with __array_finalize__ unless I'm missing something: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.subclassing.html

Looks like DesignMatrix is a subclass.

If it's discouraged we can obviously to the duck array.
The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

I'd rather have scalar operations or row-wise operations keep the column names. Basically doing zero mean, unit variance should keep the column names.
That would actually make implementing the forwarding of column names much easier for us as well. Concatenating and slicing columns should probably also keep the column names.
I hadn't though of these before, but not having them will be a pain for users, and for us. If we do this right, the feature name propagation in our preprocessing could be somewhat automatic.
I think I'm gonna prototype this; probably trying to start with duck arrays.

@shoyer

This comment has been minimized.

Copy link

shoyer commented Jul 17, 2019

it actually looks quite simple to enforce column consistency with __array_finalize__ unless I'm missing something

__array_finalize__ is called in most but not all cases. Probably the most notable exception is np.concatenate, which silently casts subclasses to base ndarrays.

A bigger issue (and there are others) is that __array_finalize__ doesn't know anything about the operation that was performed. For example, if you transpose an array, all you know is what the final shape is. So if you have a square matrix, you might blindly copy axis labels to the result even if they should be swapped.

You could fix this particular issue by implementing your own transpose method (and/or np.transpose itself with __array_function__), but it should give you a flavor of what it's like to implement a subclass: everything in NumPy "works" but not necessarily in the right way. Even if it does work right, it's not necessarily for the right reason -- it is easy to inadvertently rely on implementation details in a subclass.

👍 for testing this out with a duck array.

The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

Yes, that would make every NumPy function work.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

Thinking about it more I came to the same conclusion, thank you for your input! __array_finalize__ indeed seems hard to make work "right".

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Jul 17, 2019

Answering some comments from scikit-learn/scikit-learn#14315 here, seems more appropriate:

@thomasjpfan

We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?

With a custom NamedArray, we can have "namedarray in -> namedarray out" for every estimator.

That is what I'm trying to achieve in this PR (at least to give users a method so that they can achieve that).

@amueller

we can actually make it "just work" for many cases and don't need to implement get_feature_names in those cases!!!

Do you mean to keep the feature names there and if the transformer doesn't do anything to mess with the columns, to keep the column names? I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.

what were the issues with subclassing?

it's not straightforward at all, and there are at least two different ways of doing it as @shoyer mentioned. And I also checked some other implementations and they're also not happy with their implementations (the new duck typing method seems much better though, but more work).

With subclassing, I did manage to make it work as long as only one side of the operation is a NamedArray (when the two NamedArrays have mismatching column names it gets complicated), and it'd keep the column names for simple operations and slices. Also, that'd mean the slice operation is much slower since we're hooking into __getitem__, and we'd need to expose something like iloc or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it. At the end, it seemed like what I put in this PR now (the very simple container thingy), would be a more reasonable solution maybe.

The easiest way to do that would be to just implement __array__ and everything would be working but everything would lose the meta-data, right?

That's what I did in https://github.com/scikit-learn/scikit-learn/pull/14315/files#diff-5b184308506279093e1d9b367d5a8643 , and almost all common tests now pass with it. Still dealing with some FeaureUnion issues.

@shoyer

This comment has been minimized.

Copy link

shoyer commented Jul 17, 2019

We can duck type this: https://www.numpy.org/devdocs/user/basics.dispatch.html

I did go through that page as well, but if you read the NEP, you see that it's pretty new. I don't think our minimum numpy requirements meet those requirements at all. We basically need to be on 1.17 for it to work, I think. @shoyer am I correct?

If you're not willing to commit to NumPy 1.17+, then it isn't possible to make either a duck-array or subclass that is preserved by all NumPy operations. Operations like np.concatenate will always coerce your objects into base NumPy arrays.

This is part of why xarray implements it's own functions/methods, e.g., xarray.where vs numpy.where. The other reason is that in some cases we have intentionally chosen to implement different semantics, e.g., xarray.dot automatically sums over axes with the same name, unlike numpy.dot which sums along the last axis of the first argument and the second to last axis of the second argument.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

I think it's possible, but still at the end of the transformer we need to check if the column names are still there or not, since with many operations the column names should rightfully be gone. That's why I'm not sure if it's worth the effort.

Why not? I think it's weird and error-prone to have to reimplement what happens during the preprocessing to the columns. Having the same code handle the actual transformation and the transformation of the names seems way more straight-forward.
Yes, we'd still need to do something at the end of fit, but we'd save implementing get_feature_names for many cases, I think.

Also, that'd mean the slice operation is much slower since we're hooking into getitem, and we'd need to expose something like iloc or something to do fast slicing on the array itself, which means the rest of the codebase would need to use it.

I think with the new ducktyping interface we could do that without sing __getitem__ or iloc but if we can't use that it's going to be hard.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 17, 2019

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Jul 18, 2019

Yes, we'd still need to do something at the end of fit, but we'd save implementing get_feature_names for many cases, I think.

I'm not sure if I understand correctly. ATM, feature_names_out_ is the same as the given input feature names if it's a one to one mapping, and the output would also include those feature names.

If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray itself generate the new column names.

What I was trying to say, is that we can have a simple NamedArray implementation, keep it experimental, have the pipeline and transformers work with the feature names, and then have a more complicated NamedArray if we want.

I guess at this point I'd appreciate your feedback on scikit-learn/scikit-learn#14315 and to know if you think it's a right direction or not.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 18, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Jul 18, 2019

If the output feature names are not supposed to be the same as the input feature names, I'm not sure how you'd have the NamedArray itself generate the new column names.

SimpleImputer and feature selections drop columns, so they could work easily. Even if there's a one-to-one correspondence, we need to reattach the feature names to the output if we want to hand them on.

I'm happy to start with a simple solution and iterate from there, though. I agre with @GaelVaroquaux that preserving the feature names is not a problem we need to fix, it would just be nice, and I hadn't realized earlier that it's a possibility.

@adrinjalali I'll check out your proposal tomorrow or early next week :)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

jorisvandenbossche commented Jul 21, 2019

BTW, I am offline for a week, but feel free to push updates with a description of the alternatives to this branch

Copy link
Contributor

NicolasHug left a comment

Made a few comments.

What's the status of this SLEP? Is it blocked by #22?

Note: this only works if the Imputer didn't drop any all-NaN columns.

This SLEP proposes that the input features derived on fit time are propagated

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 8, 2019

Contributor
Suggested change
This SLEP proposes that the input features derived on fit time are propagated
This SLEP proposes that the input features derived at fit time are propagated
- set the ``input_feature_names_`` attribute (only done by ``Pipeline``
and ``MetaEstimator``)
Comment on lines +95 to +96

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 8, 2019

Contributor

Why would this only be done by pipelines? Why not for a single transformer?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 8, 2019

Contributor

Also I think we should indicate that pipelines will also set the attribute for the last step (which might not be a transformer)

Backward compatibility
----------------------

This SLEP does not affect backward compatibility, as all described attributes

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 8, 2019

Contributor

If we add common checks, there might be some "backward compatibility issues" much like for #22. Maybe that's just something to note.

features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.

A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Oct 8, 2019

Contributor

A Pipeline calls the method at the end of fit()at the end of its fit

I wasn't sure at first whether this was referring to Pipeline.fit, or to step.fit.

@hermidalc

This comment has been minimized.

Copy link

hermidalc commented Oct 27, 2019

My suggestion, and sorry I didn’t read every comment here, but that passing feature names isn’t enough, plenty of use cases where you want multiple feature metadata not just the name. Since this is an enhancement might as well go for more extensive feature metadata support because I don’t believe it’s more difficult logic-wise. As discussed in other threads why not just pass around a feature properties array-like DataFrame or numpy and have it subsetted with each feature selection step in the Pipeline and available in fit methods, just like sample properties can via **fit_params?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 27, 2019

@shoyer

This comment has been minimized.

Copy link

shoyer commented Oct 27, 2019

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Late to the party (catching up): no problem on the numpy 1.7 requirement.

I think the requirement is actually NumPy 1.17 for __array_function__?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 27, 2019

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 28, 2019

I'm not sure how using __array_function__ would hamper people's access to the numerics @GaelVaroquaux . People can still rely on older sklearn if they really want to use old numpy. Besides, if we're talking about v1.0, we should be able to think about such changes.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member Author

jorisvandenbossche commented Oct 28, 2019

What's the status of this SLEP? Is it blocked by #22?

I don't think it is directly blocked by #22 (n_features_in_) (there are a few overlapping aspects, eg regarding naming, but I think that is not touching the essential discussion here).

I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP).

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Oct 28, 2019

I think the alternative which we mostly agree with, is the one proposed in scikit-learn/scikit-learn#14315. We either need a new slep, or to update this one to reflect the the ideas there.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Oct 28, 2019

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 6, 2019

@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17)
I would definitely not trigger an update.

If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 6, 2019

@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with **fit_parms (see #16). Adding additional metadata later will probably be relatively straight-forward if/when we agree on a mechanism.
What do you do with the meta-data for PolynomialFeatures or PCA?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 6, 2019

I would suggest we write a separate new slep on the NamedArray. @adrinjalali do you want to do that? You could reuse parts of this one if appropriate.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Nov 6, 2019

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Nov 11, 2019

To be clear, the current implementation of NamedArray only requires numpy 1.13, which is what we'll have once we bump our python support to 3.6 anyway. So that's not a big issue. However, if we want to include more features in the NamedArray as @thomasjpfan has also suggested, then we'll need 1.17.

@amueller yes I'll write up a new SLEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.